Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Philipp Moritz
I also have a small preference for underscores but would also be fine with
dashes.

It seems to be more common (therefore blends better with vendored code) and
agrees with the styleguide and is closest to the exiting code. Also as an
aside, having file_names names like variable_names is nice. Compare the
Lispy way of using dashes for both.

Thanks for getting this discussion started, the mixture of dashes and
underscores has been bothering me too :)

On Tue, Aug 6, 2019 at 8:41 PM Micah Kornfield 
wrote:

> I also have a preference for underscore but can get used to anything.
>
> I agree with the points François made above about the recommendation of the
> style guide and the smaller change to the existing code base.
>
> On Tue, Aug 6, 2019 at 6:52 PM Francois Saint-Jacques <
> fsaintjacq...@gmail.com> wrote:
>
> > My vote would go with underscore to minimize changes and minimize
> > exceptions to the google style guide reference. I also suggests that
> > we add this to the linters somehow, if it's not too much trouble.
> >
> > François
> >
> > On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei  wrote:
> > >
> > > Hi,
> > >
> > > I like hyphens.
> > >
> > > Because many Linux commands use hyphens than
> > > underscores. Here are counts on my Debian GNU/Linux machine:
> > >
> > > % ls /usr/bin/ | grep -- - | wc -l
> > > 956
> > > % ls /usr/bin/ | grep _ | wc -l
> > > 343
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In <20190806140340.2a7ffab2@fsol>
> > >   "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019
> > 14:03:40 +0200,
> > >   Antoine Pitrou  wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > The filenames in the C++ source tree are a bit ad hoc and
> inconsistent.
> > > > Sometimes they use hyphens for word separation, sometimes
> underscores.
> > > > In ARROW-4648 it was proposed that we unify C++ file naming,
> therefore
> > > > there are two possible options: only hyphens, or only underscores.
> > > >
> > > > What are your preferences?  Personally, I have a slight preference
> for
> > > > hyphens, especially as they are already used in binary names.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> >
>


Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Micah Kornfield
I also have a preference for underscore but can get used to anything.

I agree with the points François made above about the recommendation of the
style guide and the smaller change to the existing code base.

On Tue, Aug 6, 2019 at 6:52 PM Francois Saint-Jacques <
fsaintjacq...@gmail.com> wrote:

> My vote would go with underscore to minimize changes and minimize
> exceptions to the google style guide reference. I also suggests that
> we add this to the linters somehow, if it's not too much trouble.
>
> François
>
> On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei  wrote:
> >
> > Hi,
> >
> > I like hyphens.
> >
> > Because many Linux commands use hyphens than
> > underscores. Here are counts on my Debian GNU/Linux machine:
> >
> > % ls /usr/bin/ | grep -- - | wc -l
> > 956
> > % ls /usr/bin/ | grep _ | wc -l
> > 343
> >
> >
> > Thanks,
> > --
> > kou
> >
> > In <20190806140340.2a7ffab2@fsol>
> >   "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019
> 14:03:40 +0200,
> >   Antoine Pitrou  wrote:
> >
> > >
> > > Hello,
> > >
> > > The filenames in the C++ source tree are a bit ad hoc and inconsistent.
> > > Sometimes they use hyphens for word separation, sometimes underscores.
> > > In ARROW-4648 it was proposed that we unify C++ file naming, therefore
> > > there are two possible options: only hyphens, or only underscores.
> > >
> > > What are your preferences?  Personally, I have a slight preference for
> > > hyphens, especially as they are already used in binary names.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
>


[jira] [Created] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-06 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6155:
---

 Summary: [Java] Extract a super interface for vectors whose 
elements reside in continuous memory segments
 Key: ARROW-6155
 URL: https://issues.apache.org/jira/browse/ARROW-6155
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Liya Fan
Assignee: Liya Fan


For vectors whose data elements reside in continuous memory segments, they 
should implement a common super interface. This will avoid unnecessary code 
branches.

For now, such vectors include fixed-width vectors and variable-width vectors. 
In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Francois Saint-Jacques
My vote would go with underscore to minimize changes and minimize
exceptions to the google style guide reference. I also suggests that
we add this to the linters somehow, if it's not too much trouble.

François

On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei  wrote:
>
> Hi,
>
> I like hyphens.
>
> Because many Linux commands use hyphens than
> underscores. Here are counts on my Debian GNU/Linux machine:
>
> % ls /usr/bin/ | grep -- - | wc -l
> 956
> % ls /usr/bin/ | grep _ | wc -l
> 343
>
>
> Thanks,
> --
> kou
>
> In <20190806140340.2a7ffab2@fsol>
>   "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019 
> 14:03:40 +0200,
>   Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > The filenames in the C++ source tree are a bit ad hoc and inconsistent.
> > Sometimes they use hyphens for word separation, sometimes underscores.
> > In ARROW-4648 it was proposed that we unify C++ file naming, therefore
> > there are two possible options: only hyphens, or only underscores.
> >
> > What are your preferences?  Personally, I have a slight preference for
> > hyphens, especially as they are already used in binary names.
> >
> > Regards
> >
> > Antoine.
> >
> >


Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Sutou Kouhei
Hi,

I like hyphens.

Because many Linux commands use hyphens than
underscores. Here are counts on my Debian GNU/Linux machine:

% ls /usr/bin/ | grep -- - | wc -l
956
% ls /usr/bin/ | grep _ | wc -l
343


Thanks,
--
kou

In <20190806140340.2a7ffab2@fsol>
  "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019 
14:03:40 +0200,
  Antoine Pitrou  wrote:

> 
> Hello,
> 
> The filenames in the C++ source tree are a bit ad hoc and inconsistent.
> Sometimes they use hyphens for word separation, sometimes underscores.
> In ARROW-4648 it was proposed that we unify C++ file naming, therefore
> there are two possible options: only hyphens, or only underscores.
> 
> What are your preferences?  Personally, I have a slight preference for
> hyphens, especially as they are already used in binary names.
> 
> Regards
> 
> Antoine.
> 
> 


[jira] [Created] (ARROW-6154) Too many open files (os error 24)

2019-08-06 Thread Yesh (JIRA)
Yesh created ARROW-6154:
---

 Summary: Too many open files (os error 24)
 Key: ARROW-6154
 URL: https://issues.apache.org/jira/browse/ARROW-6154
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Yesh


Used [rust]*parquet-read binary to read a deeply nested parquet file and see 
the below stack trace. Unfortunately won't be able to upload file.*
{code:java}
stack backtrace:

   0: std::panicking::default_hook::{{closure}}

   1: std::panicking::default_hook

   2: std::panicking::rust_panic_with_hook

   3: std::panicking::continue_panic_fmt

   4: rust_begin_unwind

   5: core::panicking::panic_fmt

   6: core::result::unwrap_failed

   7: parquet::util::io::FileSource::new

   8:  as 
parquet::file::reader::RowGroupReader>::get_column_page_reader

   9:  as 
parquet::file::reader::RowGroupReader>::get_column_reader

  10: parquet::record::reader::TreeBuilder::reader_tree

  11: parquet::record::reader::TreeBuilder::reader_tree

  12: parquet::record::reader::TreeBuilder::reader_tree

  13: parquet::record::reader::TreeBuilder::reader_tree

  14: parquet::record::reader::TreeBuilder::reader_tree

  15: parquet::record::reader::TreeBuilder::build

  16: ::next

  17: parquet_read::main

  18: std::rt::lang_start::{{closure}}

  19: std::panicking::try::do_call

  20: __rust_maybe_catch_panic

  21: std::rt::lang_start_internal

  22: main{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6153) [R] Address parquet deprecation warning

2019-08-06 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-6153:
--

 Summary: [R] Address parquet deprecation warning
 Key: ARROW-6153
 URL: https://issues.apache.org/jira/browse/ARROW-6153
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Romain François


[~wesmckinn] has been refactoring the Parquet C++ library and there's now this 
deprecation warning appearing when I build the R package locally: 
{code:java}
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" 
-DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW 
-I"/Users/enpiar/R/Rcpp/include" -isysroot 
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include  -fPIC 
 -Wall -g -O2  -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 
'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder       
[-Wdeprecated-declarations]       parquet::arrow::OpenFile(file, 
arrow::default_memory_pool(), *props, &reader));                       ^
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Wes McKinney
I note that a change from underscores to hyphens would significantly
affect the Parquet, Plasma, and Gandiva libraries so I think we need
to hear from other developers of those subprojects. Underscores are
definitely less disruptive to the status quo

On Tue, Aug 6, 2019 at 4:18 PM Wes McKinney  wrote:
>
> I have a slight gut preference for underscores but I am OK with
> changing everything to hyphens. The hyphens will probably grow on me
> as it means pressing the "shift" key less frequently. Is there any
> technical argument for using one over the other? My understanding is
> that `git blame` is pretty robust to renames
>
> On Tue, Aug 6, 2019 at 7:04 AM Antoine Pitrou  wrote:
> >
> >
> > Hello,
> >
> > The filenames in the C++ source tree are a bit ad hoc and inconsistent.
> > Sometimes they use hyphens for word separation, sometimes underscores.
> > In ARROW-4648 it was proposed that we unify C++ file naming, therefore
> > there are two possible options: only hyphens, or only underscores.
> >
> > What are your preferences?  Personally, I have a slight preference for
> > hyphens, especially as they are already used in binary names.
> >
> > Regards
> >
> > Antoine.
> >
> >


[jira] [Created] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter

2019-08-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6152:
---

 Summary: [C++][Parquet] Write arrow::Array directly into 
parquet::TypedColumnWriter
 Key: ARROW-6152
 URL: https://issues.apache.org/jira/browse/ARROW-6152
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.15.0


This is an initial refactoring task to enable the Arrow write layer to access 
some of the internal implementation details of 
{{parquet::TypedColumnWriter}}. See discussion in ARROW-3246



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Wes McKinney
I have a slight gut preference for underscores but I am OK with
changing everything to hyphens. The hyphens will probably grow on me
as it means pressing the "shift" key less frequently. Is there any
technical argument for using one over the other? My understanding is
that `git blame` is pretty robust to renames

On Tue, Aug 6, 2019 at 7:04 AM Antoine Pitrou  wrote:
>
>
> Hello,
>
> The filenames in the C++ source tree are a bit ad hoc and inconsistent.
> Sometimes they use hyphens for word separation, sometimes underscores.
> In ARROW-4648 it was proposed that we unify C++ file naming, therefore
> there are two possible options: only hyphens, or only underscores.
>
> What are your preferences?  Personally, I have a slight preference for
> hyphens, especially as they are already used in binary names.
>
> Regards
>
> Antoine.
>
>


[jira] [Created] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information

2019-08-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-6151:
---

 Summary: [R] See if possible to generate r/inst/NOTICE.txt rather 
than duplicate information
 Key: ARROW-6151
 URL: https://issues.apache.org/jira/browse/ARROW-6151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney


I noticed this file -- I am concerned about its maintainability. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

2019-08-06 Thread Wes McKinney
hi all,

As we've been discussing for the last 5 weeks or so [1], there is a
need to introduce 4 bytes of padding into the preamble of the
"encapsulated IPC message" format to ensure that the Flatbuffers
metadata payload begins on an 8-byte aligned memory offset. The
alternative to this would be for Arrow implementations where alignment
is important (e.g. C or C++) to copy the metadata (which is not always
small) into memory when it is unaligned.

Micah has proposed to address this by adding a 4-byte "continuation"
value at the beginning of the payload having the value 0x. The
reason to do it this way is that old clients will see an invalid
length (what is currently the first 4 bytes of the message -- a 32-bit
little endian signed integer indicating the metadata length) rather
than potentially crashing on a valid length.

This would be a backwards incompatible protocol change, so older Arrow
libraries would not be able to read these new messages. Maintaining
forward compatibility (reading data produced by older libraries) would
be possible as we can reason that a value other than the continuation
value was produced by an older library (and then validate the
Flatbuffer message of course). Arrow implementations could offer a
backward compatibility mode for the sake of old readers if they desire
(this may also assist with testing).

The PR making these changes to the IPC documentation is here

https://github.com/apache/arrow/pull/4951

Please vote to accept this change. This vote will be open for at least 72 hours

[ ] +1 Adopt the Arrow protocol change
[ ] +0
[ ] -1 I disagree because...

Here is my vote: +1

Thanks,
Wes

[1]: 
https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E


[jira] [Created] (ARROW-6150) Intermittent Pyarrow HDFS IO error

2019-08-06 Thread Saurabh Bajaj (JIRA)
Saurabh Bajaj created ARROW-6150:


 Summary: Intermittent Pyarrow HDFS IO error
 Key: ARROW-6150
 URL: https://issues.apache.org/jira/browse/ARROW-6150
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Saurabh Bajaj


I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
shown in traceback below) using PyArrow's HDFS IO library. However, the job 
intermittently runs into the error shown below, not every run, only sometimes. 
I'm unable to determine the root cause of this issue.

 

{{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
errno: 255 (Unknown error 255) Please check that you are connecting to the 
correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Arrow sync call tomorrow (August 7) at 12:00 US/Eastern, 16:00 UTC

2019-08-06 Thread Neal Richardson
Hi all,
Reminder that the biweekly Arrow call is tomorrow at
https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes
will be sent out to the mailing list afterwards.

Neal


[jira] [Created] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct

2019-08-06 Thread Philip Felton (JIRA)
Philip Felton created ARROW-6149:


 Summary: [Parquet] Decimal comparisons used for min/max statistics 
are not correct
 Key: ARROW-6149
 URL: https://issues.apache.org/jira/browse/ARROW-6149
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philip Felton


The 
[https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet 
Format specifications] says

bq. If the column uses int32 or int64 physical types, then signed comparison of 
the integer values produces the correct ordering. If the physical type is 
fixed, then the correct ordering can be produced by flipping the 
most-significant bit in the first byte and then using unsigned byte-wise 
comparison.

However this isn't followed in the C++ Parquet code. 16-byte decimal comparison 
is implemented using a lexicographical comparison of signed chars.

This appears to be because the function 
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183]
 just goes off the sort_order (signed) and physical_type 
(FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6148) Missing debian build dependencies

2019-08-06 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-6148:
-

 Summary: Missing debian build dependencies
 Key: ARROW-6148
 URL: https://issues.apache.org/jira/browse/ARROW-6148
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6147) [Go] implement a Flight client

2019-08-06 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-6147:
--

 Summary: [Go] implement a Flight client
 Key: ARROW-6147
 URL: https://issues.apache.org/jira/browse/ARROW-6147
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6146) [Go] implement a Plasma client

2019-08-06 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-6146:
--

 Summary: [Go] implement a Plasma client
 Key: ARROW-6146
 URL: https://issues.apache.org/jira/browse/ARROW-6146
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Sebastien Binet






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[Discuss] C++ filenames: hyphens or underscores?

2019-08-06 Thread Antoine Pitrou


Hello,

The filenames in the C++ source tree are a bit ad hoc and inconsistent.
Sometimes they use hyphens for word separation, sometimes underscores.
In ARROW-4648 it was proposed that we unify C++ file naming, therefore
there are two possible options: only hyphens, or only underscores.

What are your preferences?  Personally, I have a slight preference for
hyphens, especially as they are already used in binary names.

Regards

Antoine.




[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly

2019-08-06 Thread Ji Liu (JIRA)
Ji Liu created ARROW-6145:
-

 Summary: [Java] UnionVector created by MinorType#getNewVector 
could not keep field type info properly
 Key: ARROW-6145
 URL: https://issues.apache.org/jira/browse/ARROW-6145
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


When I worked for other items, I found {{UnionVector}} created by 
{{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not 
keep field type info properly. For example, if we set metadata in {{Field}} in 
schema, we could not get it back by {{UnionVector#getField}}.

This is mainly because {{MinorType.Union.getNewVector}} did not pass 
{{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} 
which cause inconsistent.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6144) Implement random function in Gandiva

2019-08-06 Thread Prudhvi Porandla (JIRA)
Prudhvi Porandla created ARROW-6144:
---

 Summary: Implement random function in Gandiva
 Key: ARROW-6144
 URL: https://issues.apache.org/jira/browse/ARROW-6144
 Project: Apache Arrow
  Issue Type: Task
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


Implement random(), random(int seed) functions



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-06 Thread Fan Liya
Hi Micah,

Thanks a lot for doing this.

I am a little concerned about if there is any negative performance impact
on the current 32-bit-length based applications.
Can we do some performance comparison on our existing benchmarks?

Best,
Liya Fan


On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield 
wrote:

> There have been some previous discussions on the mailing about supporting
> 64-bit lengths for  Java ValueVectors (this is what the IPC specification
> and C++ support).  I created a PR [1] that changes all APIs that I could
> find that take an index to take an "long" instead of an "int" (and
> similarly change "size/rowcount" APIs).
>
> It is a big change, so I think it is worth discussing if it is something we
> still want to move forward with.  It would be nice to come to a conclusion
> quickly, ideally in the next few days, to avoid a lot of merge conflicts.
>
> The reason I did this work now is the C++ implementation has added support
> for LargeList, LargeBinary and LargeString arrays and based on prior
> discussions we need to have similar support in Java before our next
> release. Support 64-bit indexes means we can have full compatibility and
> make the most use of the types in Java.
>
> Look forward to hearing feedback.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/pull/5020
>


[Discuss][Java] 64-bit lengths for ValueVectors

2019-08-06 Thread Micah Kornfield
There have been some previous discussions on the mailing about supporting
64-bit lengths for  Java ValueVectors (this is what the IPC specification
and C++ support).  I created a PR [1] that changes all APIs that I could
find that take an index to take an "long" instead of an "int" (and
similarly change "size/rowcount" APIs).

It is a big change, so I think it is worth discussing if it is something we
still want to move forward with.  It would be nice to come to a conclusion
quickly, ideally in the next few days, to avoid a lot of merge conflicts.

The reason I did this work now is the C++ implementation has added support
for LargeList, LargeBinary and LargeString arrays and based on prior
discussions we need to have similar support in Java before our next
release. Support 64-bit indexes means we can have full compatibility and
make the most use of the types in Java.

Look forward to hearing feedback.

Thanks,
Micah

[1] https://github.com/apache/arrow/pull/5020