Re: [Discuss] C++ filenames: hyphens or underscores?
I also have a small preference for underscores but would also be fine with dashes. It seems to be more common (therefore blends better with vendored code) and agrees with the styleguide and is closest to the exiting code. Also as an aside, having file_names names like variable_names is nice. Compare the Lispy way of using dashes for both. Thanks for getting this discussion started, the mixture of dashes and underscores has been bothering me too :) On Tue, Aug 6, 2019 at 8:41 PM Micah Kornfield wrote: > I also have a preference for underscore but can get used to anything. > > I agree with the points François made above about the recommendation of the > style guide and the smaller change to the existing code base. > > On Tue, Aug 6, 2019 at 6:52 PM Francois Saint-Jacques < > fsaintjacq...@gmail.com> wrote: > > > My vote would go with underscore to minimize changes and minimize > > exceptions to the google style guide reference. I also suggests that > > we add this to the linters somehow, if it's not too much trouble. > > > > François > > > > On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei wrote: > > > > > > Hi, > > > > > > I like hyphens. > > > > > > Because many Linux commands use hyphens than > > > underscores. Here are counts on my Debian GNU/Linux machine: > > > > > > % ls /usr/bin/ | grep -- - | wc -l > > > 956 > > > % ls /usr/bin/ | grep _ | wc -l > > > 343 > > > > > > > > > Thanks, > > > -- > > > kou > > > > > > In <20190806140340.2a7ffab2@fsol> > > > "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019 > > 14:03:40 +0200, > > > Antoine Pitrou wrote: > > > > > > > > > > > Hello, > > > > > > > > The filenames in the C++ source tree are a bit ad hoc and > inconsistent. > > > > Sometimes they use hyphens for word separation, sometimes > underscores. > > > > In ARROW-4648 it was proposed that we unify C++ file naming, > therefore > > > > there are two possible options: only hyphens, or only underscores. > > > > > > > > What are your preferences? Personally, I have a slight preference > for > > > > hyphens, especially as they are already used in binary names. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > >
Re: [Discuss] C++ filenames: hyphens or underscores?
I also have a preference for underscore but can get used to anything. I agree with the points François made above about the recommendation of the style guide and the smaller change to the existing code base. On Tue, Aug 6, 2019 at 6:52 PM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > My vote would go with underscore to minimize changes and minimize > exceptions to the google style guide reference. I also suggests that > we add this to the linters somehow, if it's not too much trouble. > > François > > On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei wrote: > > > > Hi, > > > > I like hyphens. > > > > Because many Linux commands use hyphens than > > underscores. Here are counts on my Debian GNU/Linux machine: > > > > % ls /usr/bin/ | grep -- - | wc -l > > 956 > > % ls /usr/bin/ | grep _ | wc -l > > 343 > > > > > > Thanks, > > -- > > kou > > > > In <20190806140340.2a7ffab2@fsol> > > "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019 > 14:03:40 +0200, > > Antoine Pitrou wrote: > > > > > > > > Hello, > > > > > > The filenames in the C++ source tree are a bit ad hoc and inconsistent. > > > Sometimes they use hyphens for word separation, sometimes underscores. > > > In ARROW-4648 it was proposed that we unify C++ file naming, therefore > > > there are two possible options: only hyphens, or only underscores. > > > > > > What are your preferences? Personally, I have a slight preference for > > > hyphens, especially as they are already used in binary names. > > > > > > Regards > > > > > > Antoine. > > > > > > >
[jira] [Created] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments
Liya Fan created ARROW-6155: --- Summary: [Java] Extract a super interface for vectors whose elements reside in continuous memory segments Key: ARROW-6155 URL: https://issues.apache.org/jira/browse/ARROW-6155 Project: Apache Arrow Issue Type: New Feature Reporter: Liya Fan Assignee: Liya Fan For vectors whose data elements reside in continuous memory segments, they should implement a common super interface. This will avoid unnecessary code branches. For now, such vectors include fixed-width vectors and variable-width vectors. In the future, there can be more vectors included. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] C++ filenames: hyphens or underscores?
My vote would go with underscore to minimize changes and minimize exceptions to the google style guide reference. I also suggests that we add this to the linters somehow, if it's not too much trouble. François On Tue, Aug 6, 2019 at 9:35 PM Sutou Kouhei wrote: > > Hi, > > I like hyphens. > > Because many Linux commands use hyphens than > underscores. Here are counts on my Debian GNU/Linux machine: > > % ls /usr/bin/ | grep -- - | wc -l > 956 > % ls /usr/bin/ | grep _ | wc -l > 343 > > > Thanks, > -- > kou > > In <20190806140340.2a7ffab2@fsol> > "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019 > 14:03:40 +0200, > Antoine Pitrou wrote: > > > > > Hello, > > > > The filenames in the C++ source tree are a bit ad hoc and inconsistent. > > Sometimes they use hyphens for word separation, sometimes underscores. > > In ARROW-4648 it was proposed that we unify C++ file naming, therefore > > there are two possible options: only hyphens, or only underscores. > > > > What are your preferences? Personally, I have a slight preference for > > hyphens, especially as they are already used in binary names. > > > > Regards > > > > Antoine. > > > >
Re: [Discuss] C++ filenames: hyphens or underscores?
Hi, I like hyphens. Because many Linux commands use hyphens than underscores. Here are counts on my Debian GNU/Linux machine: % ls /usr/bin/ | grep -- - | wc -l 956 % ls /usr/bin/ | grep _ | wc -l 343 Thanks, -- kou In <20190806140340.2a7ffab2@fsol> "[Discuss] C++ filenames: hyphens or underscores?" on Tue, 6 Aug 2019 14:03:40 +0200, Antoine Pitrou wrote: > > Hello, > > The filenames in the C++ source tree are a bit ad hoc and inconsistent. > Sometimes they use hyphens for word separation, sometimes underscores. > In ARROW-4648 it was proposed that we unify C++ file naming, therefore > there are two possible options: only hyphens, or only underscores. > > What are your preferences? Personally, I have a slight preference for > hyphens, especially as they are already used in binary names. > > Regards > > Antoine. > >
[jira] [Created] (ARROW-6154) Too many open files (os error 24)
Yesh created ARROW-6154: --- Summary: Too many open files (os error 24) Key: ARROW-6154 URL: https://issues.apache.org/jira/browse/ARROW-6154 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Yesh Used [rust]*parquet-read binary to read a deeply nested parquet file and see the below stack trace. Unfortunately won't be able to upload file.* {code:java} stack backtrace: 0: std::panicking::default_hook::{{closure}} 1: std::panicking::default_hook 2: std::panicking::rust_panic_with_hook 3: std::panicking::continue_panic_fmt 4: rust_begin_unwind 5: core::panicking::panic_fmt 6: core::result::unwrap_failed 7: parquet::util::io::FileSource::new 8: as parquet::file::reader::RowGroupReader>::get_column_page_reader 9: as parquet::file::reader::RowGroupReader>::get_column_reader 10: parquet::record::reader::TreeBuilder::reader_tree 11: parquet::record::reader::TreeBuilder::reader_tree 12: parquet::record::reader::TreeBuilder::reader_tree 13: parquet::record::reader::TreeBuilder::reader_tree 14: parquet::record::reader::TreeBuilder::reader_tree 15: parquet::record::reader::TreeBuilder::build 16: ::next 17: parquet_read::main 18: std::rt::lang_start::{{closure}} 19: std::panicking::try::do_call 20: __rust_maybe_catch_panic 21: std::rt::lang_start_internal 22: main{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6153) [R] Address parquet deprecation warning
Neal Richardson created ARROW-6153: -- Summary: [R] Address parquet deprecation warning Key: ARROW-6153 URL: https://issues.apache.org/jira/browse/ARROW-6153 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Romain François [~wesmckinn] has been refactoring the Parquet C++ library and there's now this deprecation warning appearing when I build the R package locally: {code:java} clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -DNDEBUG -I/usr/local/include -DARROW_R_WITH_ARROW -I"/Users/enpiar/R/Rcpp/include" -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I/usr/local/include -fPIC -Wall -g -O2 -c parquet.cpp -o parquet.o parquet.cpp:66:23: warning: 'OpenFile' is deprecated: Deprecated since 0.15.0. Use FileReaderBuilder [-Wdeprecated-declarations] parquet::arrow::OpenFile(file, arrow::default_memory_pool(), *props, &reader)); ^ {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] C++ filenames: hyphens or underscores?
I note that a change from underscores to hyphens would significantly affect the Parquet, Plasma, and Gandiva libraries so I think we need to hear from other developers of those subprojects. Underscores are definitely less disruptive to the status quo On Tue, Aug 6, 2019 at 4:18 PM Wes McKinney wrote: > > I have a slight gut preference for underscores but I am OK with > changing everything to hyphens. The hyphens will probably grow on me > as it means pressing the "shift" key less frequently. Is there any > technical argument for using one over the other? My understanding is > that `git blame` is pretty robust to renames > > On Tue, Aug 6, 2019 at 7:04 AM Antoine Pitrou wrote: > > > > > > Hello, > > > > The filenames in the C++ source tree are a bit ad hoc and inconsistent. > > Sometimes they use hyphens for word separation, sometimes underscores. > > In ARROW-4648 it was proposed that we unify C++ file naming, therefore > > there are two possible options: only hyphens, or only underscores. > > > > What are your preferences? Personally, I have a slight preference for > > hyphens, especially as they are already used in binary names. > > > > Regards > > > > Antoine. > > > >
[jira] [Created] (ARROW-6152) [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter
Wes McKinney created ARROW-6152: --- Summary: [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter Key: ARROW-6152 URL: https://issues.apache.org/jira/browse/ARROW-6152 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.15.0 This is an initial refactoring task to enable the Arrow write layer to access some of the internal implementation details of {{parquet::TypedColumnWriter}}. See discussion in ARROW-3246 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss] C++ filenames: hyphens or underscores?
I have a slight gut preference for underscores but I am OK with changing everything to hyphens. The hyphens will probably grow on me as it means pressing the "shift" key less frequently. Is there any technical argument for using one over the other? My understanding is that `git blame` is pretty robust to renames On Tue, Aug 6, 2019 at 7:04 AM Antoine Pitrou wrote: > > > Hello, > > The filenames in the C++ source tree are a bit ad hoc and inconsistent. > Sometimes they use hyphens for word separation, sometimes underscores. > In ARROW-4648 it was proposed that we unify C++ file naming, therefore > there are two possible options: only hyphens, or only underscores. > > What are your preferences? Personally, I have a slight preference for > hyphens, especially as they are already used in binary names. > > Regards > > Antoine. > >
[jira] [Created] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information
Wes McKinney created ARROW-6151: --- Summary: [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information Key: ARROW-6151 URL: https://issues.apache.org/jira/browse/ARROW-6151 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney I noticed this file -- I am concerned about its maintainability. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements
hi all, As we've been discussing for the last 5 weeks or so [1], there is a need to introduce 4 bytes of padding into the preamble of the "encapsulated IPC message" format to ensure that the Flatbuffers metadata payload begins on an 8-byte aligned memory offset. The alternative to this would be for Arrow implementations where alignment is important (e.g. C or C++) to copy the metadata (which is not always small) into memory when it is unaligned. Micah has proposed to address this by adding a 4-byte "continuation" value at the beginning of the payload having the value 0x. The reason to do it this way is that old clients will see an invalid length (what is currently the first 4 bytes of the message -- a 32-bit little endian signed integer indicating the metadata length) rather than potentially crashing on a valid length. This would be a backwards incompatible protocol change, so older Arrow libraries would not be able to read these new messages. Maintaining forward compatibility (reading data produced by older libraries) would be possible as we can reason that a value other than the continuation value was produced by an older library (and then validate the Flatbuffer message of course). Arrow implementations could offer a backward compatibility mode for the sake of old readers if they desire (this may also assist with testing). The PR making these changes to the IPC documentation is here https://github.com/apache/arrow/pull/4951 Please vote to accept this change. This vote will be open for at least 72 hours [ ] +1 Adopt the Arrow protocol change [ ] +0 [ ] -1 I disagree because... Here is my vote: +1 Thanks, Wes [1]: https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
[jira] [Created] (ARROW-6150) Intermittent Pyarrow HDFS IO error
Saurabh Bajaj created ARROW-6150: Summary: Intermittent Pyarrow HDFS IO error Key: ARROW-6150 URL: https://issues.apache.org/jira/browse/ARROW-6150 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: Saurabh Bajaj I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to determine the root cause of this issue. {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Arrow sync call tomorrow (August 7) at 12:00 US/Eastern, 16:00 UTC
Hi all, Reminder that the biweekly Arrow call is tomorrow at https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will be sent out to the mailing list afterwards. Neal
[jira] [Created] (ARROW-6149) [Parquet] Decimal comparisons used for min/max statistics are not correct
Philip Felton created ARROW-6149: Summary: [Parquet] Decimal comparisons used for min/max statistics are not correct Key: ARROW-6149 URL: https://issues.apache.org/jira/browse/ARROW-6149 Project: Apache Arrow Issue Type: Bug Reporter: Philip Felton The [https://github.com/apache/parquet-format/blob/master/LogicalTypes.md|Parquet Format specifications] says bq. If the column uses int32 or int64 physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is fixed, then the correct ordering can be produced by flipping the most-significant bit in the first byte and then using unsigned byte-wise comparison. However this isn't followed in the C++ Parquet code. 16-byte decimal comparison is implemented using a lexicographical comparison of signed chars. This appears to be because the function [https://github.com/apache/arrow/blob/master/cpp/src/parquet/statistics.cc#L183] just goes off the sort_order (signed) and physical_type (FIXED_LENGTH_BYTE_ARRAY), there is no override for decimal. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6148) Missing debian build dependencies
Francois Saint-Jacques created ARROW-6148: - Summary: Missing debian build dependencies Key: ARROW-6148 URL: https://issues.apache.org/jira/browse/ARROW-6148 Project: Apache Arrow Issue Type: Bug Components: Packaging Reporter: Francois Saint-Jacques -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6147) [Go] implement a Flight client
Sebastien Binet created ARROW-6147: -- Summary: [Go] implement a Flight client Key: ARROW-6147 URL: https://issues.apache.org/jira/browse/ARROW-6147 Project: Apache Arrow Issue Type: New Feature Reporter: Sebastien Binet -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6146) [Go] implement a Plasma client
Sebastien Binet created ARROW-6146: -- Summary: [Go] implement a Plasma client Key: ARROW-6146 URL: https://issues.apache.org/jira/browse/ARROW-6146 Project: Apache Arrow Issue Type: New Feature Reporter: Sebastien Binet -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[Discuss] C++ filenames: hyphens or underscores?
Hello, The filenames in the C++ source tree are a bit ad hoc and inconsistent. Sometimes they use hyphens for word separation, sometimes underscores. In ARROW-4648 it was proposed that we unify C++ file naming, therefore there are two possible options: only hyphens, or only underscores. What are your preferences? Personally, I have a slight preference for hyphens, especially as they are already used in binary names. Regards Antoine.
[jira] [Created] (ARROW-6145) [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly
Ji Liu created ARROW-6145: - Summary: [Java] UnionVector created by MinorType#getNewVector could not keep field type info properly Key: ARROW-6145 URL: https://issues.apache.org/jira/browse/ARROW-6145 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Ji Liu Assignee: Ji Liu When I worked for other items, I found {{UnionVector}} created by {{VectorSchemaRoot#create(Schema schema, BufferAllocator allocator)}} could not keep field type info properly. For example, if we set metadata in {{Field}} in schema, we could not get it back by {{UnionVector#getField}}. This is mainly because {{MinorType.Union.getNewVector}} did not pass {{FieldType}} to vector and {{UnionVector#getField}} create a new {{Field}} which cause inconsistent. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6144) Implement random function in Gandiva
Prudhvi Porandla created ARROW-6144: --- Summary: Implement random function in Gandiva Key: ARROW-6144 URL: https://issues.apache.org/jira/browse/ARROW-6144 Project: Apache Arrow Issue Type: Task Components: C++ - Gandiva Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla Implement random(), random(int seed) functions -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discuss][Java] 64-bit lengths for ValueVectors
Hi Micah, Thanks a lot for doing this. I am a little concerned about if there is any negative performance impact on the current 32-bit-length based applications. Can we do some performance comparison on our existing benchmarks? Best, Liya Fan On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield wrote: > There have been some previous discussions on the mailing about supporting > 64-bit lengths for Java ValueVectors (this is what the IPC specification > and C++ support). I created a PR [1] that changes all APIs that I could > find that take an index to take an "long" instead of an "int" (and > similarly change "size/rowcount" APIs). > > It is a big change, so I think it is worth discussing if it is something we > still want to move forward with. It would be nice to come to a conclusion > quickly, ideally in the next few days, to avoid a lot of merge conflicts. > > The reason I did this work now is the C++ implementation has added support > for LargeList, LargeBinary and LargeString arrays and based on prior > discussions we need to have similar support in Java before our next > release. Support 64-bit indexes means we can have full compatibility and > make the most use of the types in Java. > > Look forward to hearing feedback. > > Thanks, > Micah > > [1] https://github.com/apache/arrow/pull/5020 >
[Discuss][Java] 64-bit lengths for ValueVectors
There have been some previous discussions on the mailing about supporting 64-bit lengths for Java ValueVectors (this is what the IPC specification and C++ support). I created a PR [1] that changes all APIs that I could find that take an index to take an "long" instead of an "int" (and similarly change "size/rowcount" APIs). It is a big change, so I think it is worth discussing if it is something we still want to move forward with. It would be nice to come to a conclusion quickly, ideally in the next few days, to avoid a lot of merge conflicts. The reason I did this work now is the C++ implementation has added support for LargeList, LargeBinary and LargeString arrays and based on prior discussions we need to have similar support in Java before our next release. Support 64-bit indexes means we can have full compatibility and make the most use of the types in Java. Look forward to hearing feedback. Thanks, Micah [1] https://github.com/apache/arrow/pull/5020