[jira] [Commented] (ARROW-4800) [C++] Create/port a StatusOr implementation to be able to return a status or a type
[ https://issues.apache.org/jira/browse/ARROW-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846467#comment-16846467 ] Micah Kornfield commented on ARROW-4800: As long as we are bikeshedding, I think I ErrorOr or StatusOr are more understandable without looking at the class. Agreed on trying to replace APIs, but I think this can be somewhat incremental as we developer higher level functionality we can go down the stack. The nice thing is these APIs can live side by side since the method signature should always be different. > [C++] Create/port a StatusOr implementation to be able to return a status or > a type > --- > > Key: ARROW-4800 > URL: https://issues.apache.org/jira/browse/ARROW-4800 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.0 >Reporter: Micah Kornfield >Priority: Minor > > Example from grpc: > https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/stubs/statusor.h -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4741) [Java] Add documentation to all classes and enable checkstyle for class javadocs
[ https://issues.apache.org/jira/browse/ARROW-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4741: -- Labels: pull-request-available (was: ) > [Java] Add documentation to all classes and enable checkstyle for class > javadocs > > > Key: ARROW-4741 > URL: https://issues.apache.org/jira/browse/ARROW-4741 > Project: Apache Arrow > Issue Type: New Feature > Components: Documentation, Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Minor > Labels: pull-request-available > > This is likely a big issue. So it might pay to create subtasks for different > modules to add javadoc then do one final cleanup for enabling check-style. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5395) [C++] Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5395: Summary: [C++] Utilize stream EOS in File format (was: Utilize stream EOS in File format) > [C++] Utilize stream EOS in File format > --- > > Key: ARROW-5395 > URL: https://issues.apache.org/jira/browse/ARROW-5395 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: John Muehlhausen >Priority: Minor > Labels: pull-request-available > Original Estimate: 0.25h > Time Spent: 1h 40m > Remaining Estimate: 0h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2885) [C++] Right-justify array values in PrettyPrint
[ https://issues.apache.org/jira/browse/ARROW-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2885: Fix Version/s: (was: 0.14.0) > [C++] Right-justify array values in PrettyPrint > --- > > Key: ARROW-2885 > URL: https://issues.apache.org/jira/browse/ARROW-2885 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > > Currently we the output of {{PrettyPrint}} for an array looks as follows: > {code} > [ > 1, > NA > ] > {code} > We should right-justify it for better readability: > {code} > [ >1, > NA > ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2873) [Python] Micro-optimize scalar value instantiation
[ https://issues.apache.org/jira/browse/ARROW-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2873: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Micro-optimize scalar value instantiation > -- > > Key: ARROW-2873 > URL: https://issues.apache.org/jira/browse/ARROW-2873 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Krisztian Szucs >Priority: Minor > Fix For: 0.15.0 > > > This lead to a 20% time increase in __getitem__: > https://pandas.pydata.org/speed/arrow/#array_ops.ScalarAccess.time_getitem > See conversation: > https://github.com/apache/arrow/commit/dc80a768c0a15e62998ccd32d8353d2035302cb6#r29746119 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
[ https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2628: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] parquet.write_to_dataset is memory-hungry on large DataFrames > -- > > Key: ARROW-2628 > URL: https://issues.apache.org/jira/browse/ARROW-2628 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > See discussion in https://github.com/apache/arrow/issues/1749. We should > consider strategies for writing very large tables to a partitioned directory > scheme. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2625) [Python] Serialize timedelta64 values from pandas to Arrow interval types
[ https://issues.apache.org/jira/browse/ARROW-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2625: Fix Version/s: (was: 0.14.0) > [Python] Serialize timedelta64 values from pandas to Arrow interval types > - > > Key: ARROW-2625 > URL: https://issues.apache.org/jira/browse/ARROW-2625 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > This work is blocked on ARROW-835 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2621) [Python/CI] Use pep8speaks for Python PRs
[ https://issues.apache.org/jira/browse/ARROW-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2621: Fix Version/s: (was: 0.14.0) > [Python/CI] Use pep8speaks for Python PRs > - > > Key: ARROW-2621 > URL: https://issues.apache.org/jira/browse/ARROW-2621 > Project: Apache Arrow > Issue Type: Task > Components: Continuous Integration, Python >Reporter: Uwe L. Korn >Priority: Major > Labels: beginner > > It would be nice if we would get automated comments by > [https://pep8speaks.com/] on the Python PRs. This should be much better > readable than the current `flake8` ouput in the Travis logs. This issue is > split up into two tasks: > * Create an issue with INFRA kindly asking them for activating pep8speaks > for Arrow > * Setup {{.pep8speaks.yml}} to align with our {{flake8}} config. For > reference, see Pandas' config: > [https://github.com/pandas-dev/pandas/blob/master/.pep8speaks.yml] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1299) [Doc] Publish nightly documentation against master somewhere
[ https://issues.apache.org/jira/browse/ARROW-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1299: Fix Version/s: (was: 0.14.0) > [Doc] Publish nightly documentation against master somewhere > > > Key: ARROW-1299 > URL: https://issues.apache.org/jira/browse/ARROW-1299 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Wes McKinney >Priority: Major > > This will help catch problems with the generated documentation prior to > release time, and also allow users to read the latest prose documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1489) [C++] Add casting option to set unsafe casts to null rather than some garbage value
[ https://issues.apache.org/jira/browse/ARROW-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1489: Fix Version/s: (was: 0.14.0) > [C++] Add casting option to set unsafe casts to null rather than some garbage > value > --- > > Key: ARROW-1489 > URL: https://issues.apache.org/jira/browse/ARROW-1489 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > Null is the obvious choice when certain casts fail, like string to number, > but in other kinds of unsafe casts there may be more ambiguity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5371) [Release] Add tests for dev/release/00-prepare.sh
[ https://issues.apache.org/jira/browse/ARROW-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-5371. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4343 [https://github.com/apache/arrow/pull/4343] > [Release] Add tests for dev/release/00-prepare.sh > - > > Key: ARROW-5371 > URL: https://issues.apache.org/jira/browse/ARROW-5371 > Project: Apache Arrow > Issue Type: Test > Components: Packaging >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5398) [Python] Flight tests broken by URI changes
[ https://issues.apache.org/jira/browse/ARROW-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5398: Component/s: FlightRPC > [Python] Flight tests broken by URI changes > --- > > Key: ARROW-5398 > URL: https://issues.apache.org/jira/browse/ARROW-5398 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > The URI changes merged cleanly but they hadn't been rebased so this is > happening > https://travis-ci.org/apache/arrow/jobs/535981561#L5267 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5398) [Python] Flight tests broken by URI changes
[ https://issues.apache.org/jira/browse/ARROW-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5398: Issue Type: Bug (was: Improvement) > [Python] Flight tests broken by URI changes > --- > > Key: ARROW-5398 > URL: https://issues.apache.org/jira/browse/ARROW-5398 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > The URI changes merged cleanly but they hadn't been rebased so this is > happening > https://travis-ci.org/apache/arrow/jobs/535981561#L5267 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5398) [Python] Flight tests broken by Uri changes
Wes McKinney created ARROW-5398: --- Summary: [Python] Flight tests broken by Uri changes Key: ARROW-5398 URL: https://issues.apache.org/jira/browse/ARROW-5398 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.14.0 The URI changes merged cleanly but they hadn't been rebased so this is happening https://travis-ci.org/apache/arrow/jobs/535981561#L5267 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5398) [Python] Flight tests broken by URI changes
[ https://issues.apache.org/jira/browse/ARROW-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5398: Summary: [Python] Flight tests broken by URI changes (was: [Python] Flight tests broken by Uri changes) > [Python] Flight tests broken by URI changes > --- > > Key: ARROW-5398 > URL: https://issues.apache.org/jira/browse/ARROW-5398 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > The URI changes merged cleanly but they hadn't been rebased so this is > happening > https://travis-ci.org/apache/arrow/jobs/535981561#L5267 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5396) [JS] Ensure reader and writer support files and streams with no RecordBatches
[ https://issues.apache.org/jira/browse/ARROW-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5396: -- Labels: pull-request-available (was: ) > [JS] Ensure reader and writer support files and streams with no RecordBatches > - > > Key: ARROW-5396 > URL: https://issues.apache.org/jira/browse/ARROW-5396 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Affects Versions: 0.13.0 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > Re: https://issues.apache.org/jira/browse/ARROW-2119 and > [https://github.com/apache/arrow/pull/3871], the JS reader and writer should > support files and streams with a Schema but no RecordBatches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace
[ https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2703: Fix Version/s: (was: 0.14.0) > [C++] Always use statically-linked Boost with private namespace > --- > > Key: ARROW-2703 > URL: https://issues.apache.org/jira/browse/ARROW-2703 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > We have recently added tooling to ship Python wheels with a bundled, private > Boost (using the bcp tool). We might consider statically-linking a private > Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to > avoid any conflicts with other libraries that may use a different version of > Boost -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace
[ https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846159#comment-16846159 ] Wes McKinney commented on ARROW-2703: - Removing this from any milestone as it bears more discussion and is not an urgency > [C++] Always use statically-linked Boost with private namespace > --- > > Key: ARROW-2703 > URL: https://issues.apache.org/jira/browse/ARROW-2703 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > We have recently added tooling to ship Python wheels with a bundled, private > Boost (using the bcp tool). We might consider statically-linking a private > Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to > avoid any conflicts with other libraries that may use a different version of > Boost -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2217) [C++] Add option to use dynamic linking for compression library dependencies
[ https://issues.apache.org/jira/browse/ARROW-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2217. - Resolution: Fixed It seems that this is resolved now so long as the dynamic libraries are available {code} $ ldd ~/local/lib/libarrow.so linux-vdso.so.1 (0x7ffd1b748000) libbrotlienc.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlienc.so.1 (0x7fed95782000) libbrotlidec.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlidec.so.1 (0x7fed95773000) libglog.so.0 => /home/wesm/cpp-runtime-toolchain/lib/libglog.so.0 (0x7fed9573f000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7fed95739000) libbz2.so.1.0 => /home/wesm/cpp-runtime-toolchain/lib/libbz2.so.1.0 (0x7fed95725000) liblz4.so.1 => /home/wesm/cpp-runtime-toolchain/lib/liblz4.so.1 (0x7fed95515000) libsnappy.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libsnappy.so.1 (0x7fed95508000) libz.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libz.so.1 (0x7fed954ee000) libzstd.so.1.3.8 => /home/wesm/cpp-runtime-toolchain/lib/libzstd.so.1.3.8 (0x7fed9544) libboost_system.so.1.68.0 => /home/wesm/cpp-runtime-toolchain/lib/libboost_system.so.1.68.0 (0x7fed95439000) libboost_filesystem.so.1.68.0 => /home/wesm/cpp-runtime-toolchain/lib/libboost_filesystem.so.1.68.0 (0x7fed9541b000) libboost_regex.so.1.68.0 => /home/wesm/cpp-runtime-toolchain/lib/libboost_regex.so.1.68.0 (0x7fed95312000) libstdc++.so.6 => /home/wesm/cpp-runtime-toolchain/lib/libstdc++.so.6 (0x7fed951ce000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7fed9508) libgcc_s.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libgcc_s.so.1 (0x7fed9506c000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x7fed9504b000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7fed94e6) /lib64/ld-linux-x86-64.so.2 (0x7fed96a89000) libbrotlicommon.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlicommon.so.1 (0x7fed94e3d000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7fed94e3) libicudata.so.58 => /home/wesm/cpp-runtime-toolchain/lib/./libicudata.so.58 (0x7fed9352e000) libicui18n.so.58 => /home/wesm/cpp-runtime-toolchain/lib/./libicui18n.so.58 (0x7fed932af000) libicuuc.so.58 => /home/wesm/cpp-runtime-toolchain/lib/./libicuuc.so.58 (0x7fed930fc000) {code} > [C++] Add option to use dynamic linking for compression library dependencies > > > Key: ARROW-2217 > URL: https://issues.apache.org/jira/browse/ARROW-2217 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.14.0 > > > See discussion in https://github.com/apache/arrow/issues/1661 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2221) [C++] Nightly build with "infer" tool
[ https://issues.apache.org/jira/browse/ARROW-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2221: Fix Version/s: (was: 0.14.0) > [C++] Nightly build with "infer" tool > - > > Key: ARROW-2221 > URL: https://issues.apache.org/jira/browse/ARROW-2221 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > As a follow-up to ARROW-1626, we ought to periodically look at the output of > the "infer" tool to fix issues as they come up. This is probably too > heavyweight to run in each CI build > cc [~renesugar] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2967) [Python] Add option to treat invalid PyObject* values as null in pyarrow.array
[ https://issues.apache.org/jira/browse/ARROW-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2967: Fix Version/s: (was: 0.14.0) > [Python] Add option to treat invalid PyObject* values as null in pyarrow.array > -- > > Key: ARROW-2967 > URL: https://issues.apache.org/jira/browse/ARROW-2967 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > See discussion in ARROW-2966 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2461) [Python] Build wheels for manylinux2010 tag
[ https://issues.apache.org/jira/browse/ARROW-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2461: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Build wheels for manylinux2010 tag > --- > > Key: ARROW-2461 > URL: https://issues.apache.org/jira/browse/ARROW-2461 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Uwe L. Korn >Priority: Blocker > Fix For: 0.15.0 > > > There is now work in progress on an updated manylinux tag based on CentOS6. > We should provide wheels for this tag and the old {{manylinux1}} tag for one > release and then switch to the new tag in the release afterwards. This should > enable us also to raise the minimum compiler requirement to gcc 4.9 (or > higher once conda-forge has migrated to a newer compiler). > The relevant PEP is https://www.python.org/dev/peps/pep-0571/ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5254) [Flight][Java] DoAction does not support result streams
[ https://issues.apache.org/jira/browse/ARROW-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5254. - Resolution: Fixed Issue resolved by pull request 4250 [https://github.com/apache/arrow/pull/4250] > [Flight][Java] DoAction does not support result streams > --- > > Key: ARROW-5254 > URL: https://issues.apache.org/jira/browse/ARROW-5254 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Java >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: flight, pull-request-available > Fix For: 0.14.0 > > Time Spent: 50m > Remaining Estimate: 0h > > While Flight defines DoAction as returning a stream of results, the Java APIs > only allow returning a single result. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5395) Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5395: -- Labels: pull-request-available (was: ) > Utilize stream EOS in File format > - > > Key: ARROW-5395 > URL: https://issues.apache.org/jira/browse/ARROW-5395 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: John Muehlhausen >Priority: Minor > Labels: pull-request-available > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5239) Add support for interval types in javascript
[ https://issues.apache.org/jira/browse/ARROW-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846130#comment-16846130 ] Micah Kornfield commented on ARROW-5239: It sounds like you just need to add duration, and make sure that you can remove: https://github.com/apache/arrow/blob/master/integration/integration_test.py#L1109 > Add support for interval types in javascript > > > Key: ARROW-5239 > URL: https://issues.apache.org/jira/browse/ARROW-5239 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Micah Kornfield >Priority: Major > > Update integration_test.py to include interval tests for JSTest once this is > done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5397) Test Flight TLS support
David Li created ARROW-5397: --- Summary: Test Flight TLS support Key: ARROW-5397 URL: https://issues.apache.org/jira/browse/ARROW-5397 Project: Apache Arrow Issue Type: Test Components: FlightRPC Reporter: David Li TLS support is not tested in Flight. We need to generate certificates/keys and provide them to the language-specific test runners. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5239) Add support for interval types in javascript
[ https://issues.apache.org/jira/browse/ARROW-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846108#comment-16846108 ] Paul Taylor commented on ARROW-5239: We have the Interval year_month and day_time types in JS, but I'm not sure if this issue is about a new kind of Interval DataType. [~emkornfi...@gmail.com], any thoughts? > Add support for interval types in javascript > > > Key: ARROW-5239 > URL: https://issues.apache.org/jira/browse/ARROW-5239 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Micah Kornfield >Priority: Major > > Update integration_test.py to include interval tests for JSTest once this is > done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5318) [Python] pyarrow hdfs reader overrequests
[ https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846106#comment-16846106 ] Ivan Dimitrov commented on ARROW-5318: -- Ticket Resolution: The culprit was caching. The driver has functionality for pread that isn't exposed in the pyarrow api. The solution is to add a method for NativeFile to read_at at a specific offset. The function will make an underlying pread call that does not cache. > [Python] pyarrow hdfs reader overrequests > --- > > Key: ARROW-5318 > URL: https://issues.apache.org/jira/browse/ARROW-5318 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: Ivan Dimitrov >Priority: Blocker > > I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, > I often get 0%-300% more data sent over the network. My suspicion is that > pyarrow is reading ahead. > The pyarrow parquet reader doesn't have this behavior, and I am looking for a > way to turn off read ahead for the general HDFS interface. > I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 > (newest released version). I am on python 2.7 > I have been using wireshark to track the packets passed on the network. > I suspect it is read ahead since the time for the 1st read is much greater > than the time for 2nd read. > > The regular pyarrow reader > {code:java} > import pyarrow as pa > fs = pa.hdfs.connect(hostname, driver='libhdfs') > file_path = 'dataset/train/piece' > f = fs.open(file_path) > f.seek(0) > n_bytes = 300 > f.read(n_bytes) > {code} > > Parquet code without the same issue > {code:java} > parquet_file = 'dataset/train/parquet/part-22e3' > pf = fs.open(parquet_path) > pqf = pa.parquet.ParquetFile(pf) > data = pqf.read_row_group(0, columns=['col_name']) > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4651) [Format] Flight Location should be more flexible than a (host, port) pair
[ https://issues.apache.org/jira/browse/ARROW-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4651. - Resolution: Fixed Issue resolved by pull request 4047 [https://github.com/apache/arrow/pull/4047] > [Format] Flight Location should be more flexible than a (host, port) pair > - > > Key: ARROW-4651 > URL: https://issues.apache.org/jira/browse/ARROW-4651 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Format >Affects Versions: 0.12.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > The more future-proof solution is probably to define a URI format. gRPC > already has something like that, though we might want to define our own > format: > https://grpc.io/grpc/cpp/md_doc_naming.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-4651) [Format] Flight Location should be more flexible than a (host, port) pair
[ https://issues.apache.org/jira/browse/ARROW-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-4651: --- Assignee: David Li > [Format] Flight Location should be more flexible than a (host, port) pair > - > > Key: ARROW-4651 > URL: https://issues.apache.org/jira/browse/ARROW-4651 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC, Format >Affects Versions: 0.12.0 >Reporter: Antoine Pitrou >Assignee: David Li >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > The more future-proof solution is probably to define a URI format. gRPC > already has something like that, though we might want to define our own > format: > https://grpc.io/grpc/cpp/md_doc_naming.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5396) [JS] Ensure reader and writer support files and streams with no RecordBatches
Paul Taylor created ARROW-5396: -- Summary: [JS] Ensure reader and writer support files and streams with no RecordBatches Key: ARROW-5396 URL: https://issues.apache.org/jira/browse/ARROW-5396 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Affects Versions: 0.13.0 Reporter: Paul Taylor Assignee: Paul Taylor Fix For: 0.14.0 Re: https://issues.apache.org/jira/browse/ARROW-2119 and [https://github.com/apache/arrow/pull/3871], the JS reader and writer should support files and streams with a Schema but no RecordBatches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5395) Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846074#comment-16846074 ] John Muehlhausen commented on ARROW-5395: - https://github.com/apache/arrow/pull/4372 > Utilize stream EOS in File format > - > > Key: ARROW-5395 > URL: https://issues.apache.org/jira/browse/ARROW-5395 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: John Muehlhausen >Priority: Minor > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5395) Utilize stream EOS in File format
John Muehlhausen created ARROW-5395: --- Summary: Utilize stream EOS in File format Key: ARROW-5395 URL: https://issues.apache.org/jira/browse/ARROW-5395 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: John Muehlhausen We currently do not write EOS at the end of a Message stream inside the File format. As a result, the file cannot be parsed sequentially. This change prepares for other implementations or future reference features that parse a File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5394) [C++] Benchmarks for IsIn Kernel
Preeti Suman created ARROW-5394: --- Summary: [C++] Benchmarks for IsIn Kernel Key: ARROW-5394 URL: https://issues.apache.org/jira/browse/ARROW-5394 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Preeti Suman Add benchmarks for IsIn kernel. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5392) [C++][CI][MinGW] Disable static library build on AppVeyor
[ https://issues.apache.org/jira/browse/ARROW-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5392. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4367 [https://github.com/apache/arrow/pull/4367] > [C++][CI][MinGW] Disable static library build on AppVeyor > - > > Key: ARROW-5392 > URL: https://issues.apache.org/jira/browse/ARROW-5392 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5156) [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`
[ https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845979#comment-16845979 ] Martin Durant commented on ARROW-5156: -- Happy to add `_isfilestore` to s3fs/fsspec - I assume it just should return True? > [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with > `'NoneType' object has no attribute '_isfilestore'` > --- > > Key: ARROW-5156 > URL: https://issues.apache.org/jira/browse/ARROW-5156 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.1 > Environment: Mac, Linux >Reporter: Victor Shih >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > According to > [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files], > writing a parquet to S3 with `partition_cols` should work, but it fails for > me. Example script: > {code:java} > import pandas as pd > import sys > print(sys.version) > print(pd._version_) > df = pd.DataFrame([{'a': 1, 'b': 2}]) > df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow') > print('OK 1') > df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], > engine='pyarrow') > print('OK 2') > {code} > Output: > {noformat} > 3.5.2 (default, Feb 14 2019, 01:46:27) > [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] > 0.24.2 > OK 1 > Traceback (most recent call last): > File "./t.py", line 14, in > df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], > engine='pyarrow') > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py", > line 2203, in to_parquet > partition_cols=partition_cols, **kwargs) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py", > line 252, in to_parquet > partition_cols=partition_cols, **kwargs) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py", > line 118, in write > partition_cols=partition_cols, **kwargs) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py", > line 1227, in write_to_dataset > _mkdir_if_not_exists(fs, root_path) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py", > line 1182, in _mkdir_if_not_exists > if fs._isfilestore() and not fs.exists(path): > AttributeError: 'NoneType' object has no attribute '_isfilestore' > {noformat} > > Original issue - [https://github.com/apache/arrow/issues/4030] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-465) [C++] Investigate usage of madvise
[ https://issues.apache.org/jira/browse/ARROW-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-465: --- Fix Version/s: (was: 0.14.0) > [C++] Investigate usage of madvise > --- > > Key: ARROW-465 > URL: https://issues.apache.org/jira/browse/ARROW-465 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Priority: Major > > In some usecases (e.g. Pandas->Arrow conversion) our main constraint is page > faulting not yet accessed pages. > With {{madvise}} we can indicate our planned actions to the OS and may > improve the performance a bit in these cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-976) [Python] Provide API for defining and reading Parquet datasets with more ad hoc partition schemes
[ https://issues.apache.org/jira/browse/ARROW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-976: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Provide API for defining and reading Parquet datasets with more ad > hoc partition schemes > - > > Key: ARROW-976 > URL: https://issues.apache.org/jira/browse/ARROW-976 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5156) [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`
[ https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5156: Summary: [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'` (was: `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`) > [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with > `'NoneType' object has no attribute '_isfilestore'` > --- > > Key: ARROW-5156 > URL: https://issues.apache.org/jira/browse/ARROW-5156 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.1 > Environment: Mac, Linux >Reporter: Victor Shih >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > According to > [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files], > writing a parquet to S3 with `partition_cols` should work, but it fails for > me. Example script: > {code:java} > import pandas as pd > import sys > print(sys.version) > print(pd._version_) > df = pd.DataFrame([{'a': 1, 'b': 2}]) > df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow') > print('OK 1') > df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], > engine='pyarrow') > print('OK 2') > {code} > Output: > {noformat} > 3.5.2 (default, Feb 14 2019, 01:46:27) > [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)] > 0.24.2 > OK 1 > Traceback (most recent call last): > File "./t.py", line 14, in > df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], > engine='pyarrow') > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py", > line 2203, in to_parquet > partition_cols=partition_cols, **kwargs) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py", > line 252, in to_parquet > partition_cols=partition_cols, **kwargs) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py", > line 118, in write > partition_cols=partition_cols, **kwargs) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py", > line 1227, in write_to_dataset > _mkdir_if_not_exists(fs, root_path) > File > "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py", > line 1182, in _mkdir_if_not_exists > if fs._isfilestore() and not fs.exists(path): > AttributeError: 'NoneType' object has no attribute '_isfilestore' > {noformat} > > Original issue - [https://github.com/apache/arrow/issues/4030] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5279) [C++] Support reading delta dictionaries in IPC streams
[ https://issues.apache.org/jira/browse/ARROW-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845974#comment-16845974 ] Wes McKinney commented on ARROW-5279: - Don't think I can get to this for 0.14 > [C++] Support reading delta dictionaries in IPC streams > --- > > Key: ARROW-5279 > URL: https://issues.apache.org/jira/browse/ARROW-5279 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > This JIRA covers the read path for delta dictionaries. The write path is a > bit more of a can of worms (since the deltas must be computed) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5279) [C++] Support reading delta dictionaries in IPC streams
[ https://issues.apache.org/jira/browse/ARROW-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5279: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Support reading delta dictionaries in IPC streams > --- > > Key: ARROW-5279 > URL: https://issues.apache.org/jira/browse/ARROW-5279 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > This JIRA covers the read path for delta dictionaries. The write path is a > bit more of a can of worms (since the deltas must be computed) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5128) [Packaging][CentOS][Conda] Numpy not found in nightly builds
[ https://issues.apache.org/jira/browse/ARROW-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845972#comment-16845972 ] Wes McKinney commented on ARROW-5128: - What is the status of this? > [Packaging][CentOS][Conda] Numpy not found in nightly builds > > > Key: ARROW-5128 > URL: https://issues.apache.org/jira/browse/ARROW-5128 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Fix For: 0.14.0 > > > In the last three days centos-7 and conda-win builds have been failing with > numpy not found > - https://travis-ci.org/kszucs/crossbow/builds/515638053 > - https://ci.appveyor.com/project/kszucs/crossbow/builds/23593736 > - https://ci.appveyor.com/project/kszucs/crossbow/builds/23563730 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5069) [C++] Implement direct support for shared memory arrow columns
[ https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845970#comment-16845970 ] Wes McKinney commented on ARROW-5069: - [~dimlek] It seems like you would need to draft a more detailed proposal document to go into detail about how things should ideally work. The Arrow data structures can reference the memory from any {{Buffer}} subclass, and we already have examples of referencing shared memory and GPU memory. So all of the machinery is built already. The question becomes what kind of API can yield shared-memory data structures. I'm interested to see what you propose > [C++] Implement direct support for shared memory arrow columns > -- > > Key: ARROW-5069 > URL: https://issues.apache.org/jira/browse/ARROW-5069 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Environment: Linux >Reporter: Dimitris Lekkas >Priority: Major > Labels: perfomance, proposal > > I consider the option of memory-mapping columns to shared memory to be > valuable. Such option will be triggered if specific metadata are supplied. > Given that many data frames backed by arrow are used for machine learning I > guess we could somehow benefit from treating differently the data (most > likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable > such change we would need to address the following issues: > First, we need each column to hold an integer value representing its > associated file descriptor. The application developer could retrieve the > file-name from the file descriptor (i.e fstat syscall) and inform another > application to reference that file or inform an FPGA to DMA that memory-area. > We also need to support variable buffer alignment (restricted to powers-of-2 > of course) when initiating an arrow::AllocateBuffer() call. By inspecting > the current implementation, the alignment size is fixed at 64 bytes and to > change that value a recompilation is required [1]. > To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit > heavily from page-aligned buffers since their device memory is 4KB [2]. > Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned > buffer from CPU memory to FPGA's memory [3]. > Wouldn't it be nice if we could issue from_pandas() and then have our columns > memory mapped to shared memory for FPGAs to DMA such memory and accelerate > the workload? If there is already a workaround to achieve that I would like > more info on that. > I am open to discuss any suggestions, improvements or concerns. > > [1]: > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40] > [2]: > [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593] > [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5069) [C++] Implement direct support for shared memory arrow columns
[ https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5069: Fix Version/s: (was: 0.14.0) > [C++] Implement direct support for shared memory arrow columns > -- > > Key: ARROW-5069 > URL: https://issues.apache.org/jira/browse/ARROW-5069 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Environment: Linux >Reporter: Dimitris Lekkas >Priority: Major > Labels: perfomance, proposal > > I consider the option of memory-mapping columns to shared memory to be > valuable. Such option will be triggered if specific metadata are supplied. > Given that many data frames backed by arrow are used for machine learning I > guess we could somehow benefit from treating differently the data (most > likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable > such change we would need to address the following issues: > First, we need each column to hold an integer value representing its > associated file descriptor. The application developer could retrieve the > file-name from the file descriptor (i.e fstat syscall) and inform another > application to reference that file or inform an FPGA to DMA that memory-area. > We also need to support variable buffer alignment (restricted to powers-of-2 > of course) when initiating an arrow::AllocateBuffer() call. By inspecting > the current implementation, the alignment size is fixed at 64 bytes and to > change that value a recompilation is required [1]. > To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit > heavily from page-aligned buffers since their device memory is 4KB [2]. > Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned > buffer from CPU memory to FPGA's memory [3]. > Wouldn't it be nice if we could issue from_pandas() and then have our columns > memory mapped to shared memory for FPGAs to DMA such memory and accelerate > the workload? If there is already a workaround to achieve that I would like > more info on that. > I am open to discuss any suggestions, improvements or concerns. > > [1]: > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40] > [2]: > [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593] > [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5069) [C++] Implement direct support for shared memory arrow columns
[ https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5069: Summary: [C++] Implement direct support for shared memory arrow columns (was: Implement direct support for shared memory arrow columns) > [C++] Implement direct support for shared memory arrow columns > -- > > Key: ARROW-5069 > URL: https://issues.apache.org/jira/browse/ARROW-5069 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Environment: Linux >Reporter: Dimitris Lekkas >Priority: Major > Labels: perfomance, proposal > Fix For: 0.14.0 > > > I consider the option of memory-mapping columns to shared memory to be > valuable. Such option will be triggered if specific metadata are supplied. > Given that many data frames backed by arrow are used for machine learning I > guess we could somehow benefit from treating differently the data (most > likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable > such change we would need to address the following issues: > First, we need each column to hold an integer value representing its > associated file descriptor. The application developer could retrieve the > file-name from the file descriptor (i.e fstat syscall) and inform another > application to reference that file or inform an FPGA to DMA that memory-area. > We also need to support variable buffer alignment (restricted to powers-of-2 > of course) when initiating an arrow::AllocateBuffer() call. By inspecting > the current implementation, the alignment size is fixed at 64 bytes and to > change that value a recompilation is required [1]. > To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit > heavily from page-aligned buffers since their device memory is 4KB [2]. > Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned > buffer from CPU memory to FPGA's memory [3]. > Wouldn't it be nice if we could issue from_pandas() and then have our columns > memory mapped to shared memory for FPGAs to DMA such memory and accelerate > the workload? If there is already a workaround to achieve that I would like > more info on that. > I am open to discuss any suggestions, improvements or concerns. > > [1]: > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40] > [2]: > [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593] > [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5066) [Integration] Add flags to enable/disable implementations in integration/integration_test.py
[ https://issues.apache.org/jira/browse/ARROW-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5066. - Resolution: Fixed Assignee: Wes McKinney I think the flags added in ARROW-3144 are sufficient > [Integration] Add flags to enable/disable implementations in > integration/integration_test.py > > > Key: ARROW-5066 > URL: https://issues.apache.org/jira/browse/ARROW-5066 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > This will make it easier to test pairwise binary protocol integration (e.g. > only C++ vs JS, or Java vs C++). Currently it's an all-or-nothing affair -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize
[ https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1894: Fix Version/s: (was: 0.14.0) > [Python] Treat CPython memoryview or buffer objects equivalently to > pyarrow.Buffer in pyarrow.serialize > --- > > Key: ARROW-1894 > URL: https://issues.apache.org/jira/browse/ARROW-1894 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > > These should be treated as Buffer-like on serialize. We should consider how > to "box" the buffers as the appropriate kind of object (Buffer, memoryview, > etc.) when being deserialized -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-5052) [C++] Add an incomplete dictionary type
[ https://issues.apache.org/jira/browse/ARROW-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-5052. --- Resolution: Won't Fix Fix Version/s: (was: 0.14.0) Closing in favor of solution merged in ARROW-3144 > [C++] Add an incomplete dictionary type > --- > > Key: ARROW-5052 > URL: https://issues.apache.org/jira/browse/ARROW-5052 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.12.1 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > This would allow to pass a {{DataType}} that means "dict-encoded data with > the given index types and value types, but the actual values are not yet > known" (they might be inferred by processing non-dict-encoded data, or they > might be transferred explicitly - but later - in the data stream). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1741) [C++] Comparison function for DictionaryArray to determine if indices are "compatible"
[ https://issues.apache.org/jira/browse/ARROW-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845961#comment-16845961 ] Wes McKinney commented on ARROW-1741: - Now that we have variable dictionaries in C++, having a function to determine if two DictionaryArrays can be compared without a unification step would be useful. > [C++] Comparison function for DictionaryArray to determine if indices are > "compatible" > -- > > Key: ARROW-1741 > URL: https://issues.apache.org/jira/browse/ARROW-1741 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > For example, if one array's dictionary is larger than the other, but the > overlapping beginning portion is the same, then the respective dictionary > indices correspond to the same values. Therefore, in analytics, one may > choose to drop the smaller dictionary in favor of the larger dictionary, and > this need not incur any computational overhead (beyond comparing the > dictionary prefixes -- there may be some way to engineer "dictionary lineage" > to make this comparison even cheaper) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1789) [Format] Consolidate specification documents and improve clarity for new implementation authors
[ https://issues.apache.org/jira/browse/ARROW-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1789: Fix Version/s: (was: 0.14.0) 1.0.0 > [Format] Consolidate specification documents and improve clarity for new > implementation authors > --- > > Key: ARROW-1789 > URL: https://issues.apache.org/jira/browse/ARROW-1789 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney >Assignee: Micah Kornfield >Priority: Major > Fix For: 1.0.0 > > > See discussion in https://github.com/apache/arrow/issues/1296 > I believe the specification documents Layout.md, Metadata.md, and IPC.md > would benefit from being consolidated into a single Markdown document that > would be sufficient (along with the Flatbuffers schemas) to create a complete > Arrow implementation capable of reading and writing the binary format -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1786) [Format] List expected on-wire buffer layouts for each kind of Arrow physical type in specification
[ https://issues.apache.org/jira/browse/ARROW-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1786: Fix Version/s: (was: 0.14.0) 1.0.0 > [Format] List expected on-wire buffer layouts for each kind of Arrow physical > type in specification > --- > > Key: ARROW-1786 > URL: https://issues.apache.org/jira/browse/ARROW-1786 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: columnar-format-1.0 > Fix For: 1.0.0 > > > see ARROW-1693, ARROW-1785 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1700) [JS] Implement Node.js client for Plasma store
[ https://issues.apache.org/jira/browse/ARROW-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1700: Fix Version/s: (was: 0.14.0) > [JS] Implement Node.js client for Plasma store > -- > > Key: ARROW-1700 > URL: https://issues.apache.org/jira/browse/ARROW-1700 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ - Plasma, JavaScript >Reporter: Robert Nishihara >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1599) [C++][Parquet] Unable to read Parquet files with list inside struct
[ https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1599: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++][Parquet] Unable to read Parquet files with list inside struct > --- > > Key: ARROW-1599 > URL: https://issues.apache.org/jira/browse/ARROW-1599 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.7.0 > Environment: Ubuntu >Reporter: Jovann Kung >Assignee: Joshua Storck >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > Is PyArrow currently unable to read in Parquet files with a vector as a > column? For example, the schema of such a file is below: > {{ > mbc: FLOAT > deltae: FLOAT > labels: FLOAT > features.type: INT32 INT_8 > features.size: INT32 > features.indices.list.element: INT32 > features.values.list.element: DOUBLE}} > Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() > yields the following error: ArrowNotImplementedError: Currently only nesting > with Lists is supported. > From the error I assume that this may be implemented in further releases? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels
[ https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1644: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++][Parquet] Read and write nested Parquet data with a mix of struct and > list nesting levels > -- > > Key: ARROW-1644 > URL: https://issues.apache.org/jira/browse/ARROW-1644 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: DB Tsai >Assignee: Joshua Storck >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.15.0 > > > We have many nested parquet files generated from Apache Spark for ranking > problems, and we would like to load them in python for other programs to > consume. > The schema looks like > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- show_title_id: integer (nullable = true) > |||-- duration: double (nullable = true) > {code} > And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got > the following error. > {code:python} > Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) > [GCC 7.2.0] on linux > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy as np > >>> import pandas as pd > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> table2 = pq.read_table('part-0') > Traceback (most recent call last): > File "", line 1, in > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 823, in read_table > use_pandas_metadata=use_pandas_metadata) > File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", > line 119, in read > nthreads=nthreads) > File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all > File "error.pxi", line 85, in pyarrow.lib.check_status > pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported. > {code} > I somehow get the impression that after > https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be > able to load the nested parquet in pyarrow. > Any insight about this? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1682) [Python] Add documentation / example for reading a directory of Parquet files on S3
[ https://issues.apache.org/jira/browse/ARROW-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1682: Fix Version/s: (was: 0.14.0) > [Python] Add documentation / example for reading a directory of Parquet files > on S3 > --- > > Key: ARROW-1682 > URL: https://issues.apache.org/jira/browse/ARROW-1682 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: filesystem, parquet > > Opened based on comment > https://github.com/apache/arrow/pull/916#issuecomment-337563492 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1621) [JAVA] Reduce Heap Usage per Vector
[ https://issues.apache.org/jira/browse/ARROW-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845956#comment-16845956 ] Wes McKinney commented on ARROW-1621: - [~siddteotia] ? > [JAVA] Reduce Heap Usage per Vector > --- > > Key: ARROW-1621 > URL: https://issues.apache.org/jira/browse/ARROW-1621 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Siddharth Teotia >Assignee: Siddharth Teotia >Priority: Major > Fix For: 0.14.0 > > > https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature
[ https://issues.apache.org/jira/browse/ARROW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1570: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Define API for creating a kernel instance from function of scalar input > and output with a particular signature > > > Key: ARROW-1570 > URL: https://issues.apache.org/jira/browse/ARROW-1570 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > Fix For: 0.15.0 > > > This could include an {{std::function}} instance (but these cannot be inlined > by the C++ compiler), but should also permit use with inline-able functions > or functors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1470) [C++] Add BufferAllocator abstract interface
[ https://issues.apache.org/jira/browse/ARROW-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1470: Fix Version/s: (was: 0.14.0) > [C++] Add BufferAllocator abstract interface > > > Key: ARROW-1470 > URL: https://issues.apache.org/jira/browse/ARROW-1470 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > > There are some situations ({{arrow::ipc::SerializeRecordBatch}} where we pass > a {{MemoryPool*}} solely to call {{AllocateBuffer}} using it. This is not as > flexible as it could be, since there are situation where we may wish to > allocate from shared memory instead. > So instead: > {code} > Func(..., BufferAllocator* allocator, ...) { > ... > std::shared_ptr buffer; > RETURN_NOT_OK(allocator->Allocate(nbytes, &buffer)); > ... > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1324) [C++] Support ARROW_BOOST_VENDORED on Windows / MSVC
[ https://issues.apache.org/jira/browse/ARROW-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1324: Fix Version/s: (was: 0.14.0) > [C++] Support ARROW_BOOST_VENDORED on Windows / MSVC > > > Key: ARROW-1324 > URL: https://issues.apache.org/jira/browse/ARROW-1324 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: windows > > Follow up to ARROW-1303 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1382) [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize
[ https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1382: Fix Version/s: (was: 0.14.0) > [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize > --- > > Key: ARROW-1382 > URL: https://issues.apache.org/jira/browse/ARROW-1382 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > If a Python object appears multiple times within a list/tuple/dictionary, > then when pyarrow serializes the object, it will duplicate the object many > times. This leads to a potentially huge expansion in the size of the object > (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 > times bigger than it needs to be). > {code} > import pyarrow as pa > l = [0] > original_object = [l, l] > # Serialize and deserialize the object. > buf = pa.serialize(original_object).to_buffer() > new_object = pa.deserialize(buf) > # This works. > assert original_object[0] is original_object[1] > # This fails. > assert new_object[0] is new_object[1] > {code} > One potential way to address this is to use the Arrow dictionary encoding. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1266) [Plasma] Move heap allocations to arrow memory pool
[ https://issues.apache.org/jira/browse/ARROW-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1266: Fix Version/s: (was: 0.14.0) > [Plasma] Move heap allocations to arrow memory pool > --- > > Key: ARROW-1266 > URL: https://issues.apache.org/jira/browse/ARROW-1266 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma >Reporter: Philipp Moritz >Priority: Major > > At the moment we are allocating memory with std::vectors and even new in some > places, this should be cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1271) [Packaging] Build scripts for creating nightly conda-forge-compatible package builds
[ https://issues.apache.org/jira/browse/ARROW-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1271: Fix Version/s: (was: 0.14.0) 0.15.0 > [Packaging] Build scripts for creating nightly conda-forge-compatible package > builds > > > Key: ARROW-1271 > URL: https://issues.apache.org/jira/browse/ARROW-1271 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.15.0 > > > cc [~cpcloud] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845947#comment-16845947 ] Martin Durant commented on ARROW-5349: -- > in which this would be wrong if it is inside the file itself Agreed, the path would be wrong. Even in the simpler case, above, you could say it was wrong based on the thrift template - and this could make sense, as it maybe implies opening a new file. > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 2h > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5218) [C++] Improve build when third-party library locations are specified
[ https://issues.apache.org/jira/browse/ARROW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5218. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4207 [https://github.com/apache/arrow/pull/4207] > [C++] Improve build when third-party library locations are specified > - > > Key: ARROW-5218 > URL: https://issues.apache.org/jira/browse/ARROW-5218 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Deepak Majeti >Assignee: Deepak Majeti >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > The current CMake build system does not handle user specified third-party > library locations well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1119) [Python/C++] Implement NativeFile interfaces for Amazon S3
[ https://issues.apache.org/jira/browse/ARROW-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1119: Fix Version/s: (was: 0.14.0) 0.15.0 > [Python/C++] Implement NativeFile interfaces for Amazon S3 > -- > > Key: ARROW-1119 > URL: https://issues.apache.org/jira/browse/ARROW-1119 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > Fix For: 0.15.0 > > > While we support HDFS and the local file system now, it would be nice to also > support S3 and eventually other cloud storage natively -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage
[ https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1231: Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Add filesystem / IO implementation for Google Cloud Storage > - > > Key: ARROW-1231 > URL: https://issues.apache.org/jira/browse/ARROW-1231 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > Fix For: 0.15.0 > > > See example jumping off point > https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1089) [C++/Python] Add API to write an Arrow stream into either the stream or file formats on disk
[ https://issues.apache.org/jira/browse/ARROW-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845934#comment-16845934 ] Wes McKinney commented on ARROW-1089: - cc [~npr] > [C++/Python] Add API to write an Arrow stream into either the stream or file > formats on disk > > > Key: ARROW-1089 > URL: https://issues.apache.org/jira/browse/ARROW-1089 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > For Arrow streams with unknown size, it would be useful to be able to write > the data to disk either as a stream or as the file format (for random access) > with minimal overhead; i.e. we would avoid record batch IPC loading and write > the raw messages directly to disk -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1013) [C++] Add asynchronous StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1013: Fix Version/s: (was: 0.14.0) > [C++] Add asynchronous StreamWriter > --- > > Key: ARROW-1013 > URL: https://issues.apache.org/jira/browse/ARROW-1013 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > We may want to provide an option to limit the queuing depth. The async writer > can be initialized from a synchronous writer -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-974) [Website] Add Use Cases section to the website
[ https://issues.apache.org/jira/browse/ARROW-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-974: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [Website] Add Use Cases section to the website > -- > > Key: ARROW-974 > URL: https://issues.apache.org/jira/browse/ARROW-974 > Project: Apache Arrow > Issue Type: New Feature > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > This will contain a list of "canonical use cases" for Arrow: > * In-memory data structure for vectorized analytics / SIMD, or creating a > column-oriented analytic database system > * Reading and writing columnar storage formats like Apache Parquet > * Faster alternative to Thrift, Protobuf, or Avro in RPC > * Shared memory IPC (zero-copy in-situ analytics) > Any other ideas? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1042) [Python] C++ API plumbing for returning generic instance of ipc::RecordBatchReader to user
[ https://issues.apache.org/jira/browse/ARROW-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1042: Fix Version/s: (was: 0.14.0) > [Python] C++ API plumbing for returning generic instance of > ipc::RecordBatchReader to user > -- > > Key: ARROW-1042 > URL: https://issues.apache.org/jira/browse/ARROW-1042 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > > Currently we have no mechanism of wrapping a > {{std::shared_ptr}} like we do with some other > Arrow types -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1009) [C++] Create asynchronous version of StreamReader
[ https://issues.apache.org/jira/browse/ARROW-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1009: Fix Version/s: (was: 0.14.0) > [C++] Create asynchronous version of StreamReader > - > > Key: ARROW-1009 > URL: https://issues.apache.org/jira/browse/ARROW-1009 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > the {{AsyncStreamReader}} would buffer the next record batch in a background > thread, while emulating the current synchronous / blocking API -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-973) [Website] Add FAQ page about project
[ https://issues.apache.org/jira/browse/ARROW-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-973: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [Website] Add FAQ page about project > > > Key: ARROW-973 > URL: https://issues.apache.org/jira/browse/ARROW-973 > Project: Apache Arrow > Issue Type: New Feature > Components: Website >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > As some suggested initial topics for the FAQ: > * How Apache Arrow is related to Apache Parquet (the difference between a > "storage format" and an "in-memory format" causes confusion) > * How is Arrow similar to / different from Flatbuffers and Cap'n Proto > * How Arrow uses Flatbuffers (I have had people incorrectly state to me > things like "Arrow is just Flatbuffers under the hood") > Any other ideas? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-823) [Python] Devise a means to serialize arrays of arbitrary Python objects in Arrow IPC messages
[ https://issues.apache.org/jira/browse/ARROW-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-823: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Devise a means to serialize arrays of arbitrary Python objects in > Arrow IPC messages > - > > Key: ARROW-823 > URL: https://issues.apache.org/jira/browse/ARROW-823 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.15.0 > > > Practically speaking, this would involve a "custom" logical type that is > "pyobject", represented physically as an array of 64-bit pointers. On > serialization, this would need to be converted to a BinaryArray containing > pickled objects as binary values > At the moment, we don't yet have the machinery to deal with "custom" types > where the in-memory representation is different from the on-wire > representation. This would be a useful use case to work through the design > issues > Interestingly, if done properly, this would enable other Arrow > implementations to manipulate (filter, etc.) serialized Python objects as > binary blobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull
[ https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-971: --- Labels: dataframe (was: pull-request-available) > [C++/Python] Implement Array.isvalid/notnull/isnull > --- > > Key: ARROW-971 > URL: https://issues.apache.org/jira/browse/ARROW-971 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: dataframe > Fix For: 0.14.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > For arrays with nulls, this amounts to returning the validity bitmap. Without > nulls, an array of all 1 bits must be constructed. For isnull, the bits must > be flipped (in this case, the un-set part of the new bitmap must stay 0, > though). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-721) [Java] Read and write record batches to shared memory
[ https://issues.apache.org/jira/browse/ARROW-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845929#comment-16845929 ] Wes McKinney commented on ARROW-721: [~siddteotia] is this something of interest for the next release to validate the new Java capabilities? > [Java] Read and write record batches to shared memory > - > > Key: ARROW-721 > URL: https://issues.apache.org/jira/browse/ARROW-721 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > It would be useful for a Java application to be able to read a record batch > as a set of memory mapped byte buffers given a file name and a memory address > for the metadata. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-653) [Python / C++] Add debugging function to print an array's buffer contents in hexadecimal
[ https://issues.apache.org/jira/browse/ARROW-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-653: -- Assignee: Anatoly Myachev > [Python / C++] Add debugging function to print an array's buffer contents in > hexadecimal > > > Key: ARROW-653 > URL: https://issues.apache.org/jira/browse/ARROW-653 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Anatoly Myachev >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > This would help with debugging and illustrating the Arrow internals -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-517) [C++] Verbose Array::Equals
[ https://issues.apache.org/jira/browse/ARROW-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-517: --- Fix Version/s: (was: 0.14.0) > [C++] Verbose Array::Equals > --- > > Key: ARROW-517 > URL: https://issues.apache.org/jira/browse/ARROW-517 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Uwe L. Korn >Assignee: Benjamin Kietzman >Priority: Major > > In failing unit tests I often wished {{Array::Equals}} would tell me where > they aren't equal. This would save a lot of time in debugging. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type
[ https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-488: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [Python] Implement conversion between integer coded as floating points with > NaN to an Arrow integer type > > > Key: ARROW-488 > URL: https://issues.apache.org/jira/browse/ARROW-488 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > Fix For: 0.15.0 > > > For example: if pandas has casted integer data to float, this would enable > the integer data to be recovered (so long as the values fall in the ~2^53 > floating point range for exact integer representation) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-555: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] String algorithm library for StringArray/BinaryArray > -- > > Key: ARROW-555 > URL: https://issues.apache.org/jira/browse/ARROW-555 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > Fix For: 0.15.0 > > > This is a parent JIRA for starting a module for processing strings in-memory > arranged in Arrow format. This will include using the re2 C++ regular > expression library and other standard string manipulations (such as those > found on Python's string objects) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-501) [C++] Implement concurrent / buffering InputStream for streaming data use cases
[ https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-501: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Implement concurrent / buffering InputStream for streaming data use > cases > --- > > Key: ARROW-501 > URL: https://issues.apache.org/jira/browse/ARROW-501 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: csv, filesystem, pull-request-available > Fix For: 0.15.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > Related to ARROW-500, when processing an input data stream, we may wish to > continue buffering input (up to an maximum buffer size) in between > synchronous Read calls -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-453) [C++] Add file interface implementations for Amazon S3
[ https://issues.apache.org/jira/browse/ARROW-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-453: --- Fix Version/s: (was: 0.14.0) 0.15.0 > [C++] Add file interface implementations for Amazon S3 > -- > > Key: ARROW-453 > URL: https://issues.apache.org/jira/browse/ARROW-453 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: filesystem > Fix For: 0.15.0 > > > The BSD-licensed C++ code in SFrame > (https://github.com/turi-code/SFrame/tree/master/oss_src/fileio) may provide > some inspiration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-258) [Format] clarify definition of Buffer in context of RPC, IPC, File
[ https://issues.apache.org/jira/browse/ARROW-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-258. -- Resolution: Won't Fix The "page" field was remove from Buffer in 0.8.0 release > [Format] clarify definition of Buffer in context of RPC, IPC, File > -- > > Key: ARROW-258 > URL: https://issues.apache.org/jira/browse/ARROW-258 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Julien Le Dem >Priority: Major > Fix For: 0.14.0 > > > currently Buffer has a loosely defined page field used for shared memory only. > https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L109 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file
[ https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-473: --- Fix Version/s: (was: 0.14.0) > [C++/Python] Add public API for retrieving block locations for a particular > HDFS file > - > > Key: ARROW-473 > URL: https://issues.apache.org/jira/browse/ARROW-473 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > Labels: filesystem, hdfs, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > This is necessary for applications looking to schedule data-local work. > libhdfs does not have APIs to request the block locations directly, so we > need to see if the {{hdfsGetHosts}} function will do what we need. For > libhdfs3 there is a public API function -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845920#comment-16845920 ] Martin Durant commented on ARROW-5349: -- It depends on what is passed back to the caller: just the metadata object, or some indication of which file it went into (sorry, I don't know the API that's being built exactly). If the caller defines which file to write to, it would seem reasonable to let it set this attribute on the metadata object before writing to `_metadata`. However, that might be muddied if partitioning is also happening upon write and you end up with multiple files for each piece. > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5393) [R] Add tests and example for read_parquet()
[ https://issues.apache.org/jira/browse/ARROW-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5393: -- Labels: parquet pull-request-available (was: parquet) > [R] Add tests and example for read_parquet() > > > Key: ARROW-5393 > URL: https://issues.apache.org/jira/browse/ARROW-5393 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: parquet, pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5393) [R] Add tests and example for read_parquet()
Neal Richardson created ARROW-5393: -- Summary: [R] Add tests and example for read_parquet() Key: ARROW-5393 URL: https://issues.apache.org/jira/browse/ARROW-5393 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-412) [Format] Handling of buffer padding in the IPC metadata
[ https://issues.apache.org/jira/browse/ARROW-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-412: - Labels: pull-request-available (was: ) > [Format] Handling of buffer padding in the IPC metadata > --- > > Key: ARROW-412 > URL: https://issues.apache.org/jira/browse/ARROW-412 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > See discussion in ARROW-399. Do we include padding bytes in the metadata or > set the actual used bytes? In the latter case, the padding would be a part of > the format (any buffers continue to be expected to be 64-byte padded, to > permit AVX512 instructions) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845908#comment-16845908 ] Rick Zamora edited comment on ARROW-5349 at 5/22/19 2:17 PM: - Okay - the file path should not be set in the footer metadata (only in _metadata). Does this mean that a mechanism for setting the file_path in C++ is completely unnecessary? My understanding is that the motivation for this issue was to populate the file_path for the following step of writing the metadata file. Is it sufficient to add a python-only mechanism to set the path? Or should we leave it up to the user to modify the metadata object themselves? was (Author: rjzamora): Okay - the file path should not be set in the footer metadata (only in _metadata). Does this mean that a mechanism for setting the file_path in C++ is completely unnecessary? My understanding is that the motivation for this issue was to populate the file_path for the following step of writing the metadata file. Is it sufficient to add a python-only mechanism to set the path? > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845908#comment-16845908 ] Rick Zamora commented on ARROW-5349: Okay - the file path should not be set in the footer metadata (only in _metadata). Does this mean that a mechanism for setting the file_path in C++ is completely unnecessary? My understanding is that the motivation for this issue was to populate the file_path for the following step of writing the metadata file. Is it sufficient to add a python-only mechanism to set the path? > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845886#comment-16845886 ] Martin Durant commented on ARROW-5349: -- > I think it's acceptable to set the path in the file's internal metadata. A library loading that data file in isolation can (and maybe *should*) be confused by this, though. Maybe that would not be typical operation, but we shouldn't preclude it. > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845885#comment-16845885 ] Martin Durant commented on ARROW-5349: -- Agreed on that last point, to let the caller set the path - if only because this is basically what fastparquet does. > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845883#comment-16845883 ] Joris Van den Bossche commented on ARROW-5349: -- Given that, for the API, it might make more sense to actually add a way to set the file path directly on the metadata object, instead of passing it to {{ParquetFileWriter}}. So that as a user of this API in python, you can set the path yourself on the metadata object that is returned by {{pq.ParquetWriter}} (which is appended to the metadata_collector). > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3822) [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on row groups with chunked columns
[ https://issues.apache.org/jira/browse/ARROW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845880#comment-16845880 ] Benjamin Kietzman commented on ARROW-3822: -- [~wesmckinn] is this still an issue? > [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on > row groups with chunked columns > --- > > Key: ARROW-3822 > URL: https://issues.apache.org/jira/browse/ARROW-3822 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Benjamin Kietzman >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > If a BinaryArray / StringArray overflows a single column when reading a row > group, the resulting table will have a ChunkedArray. Using TableBatchReader > in > https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176 > will therefore only return a part of the row group, discarding the rest -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845879#comment-16845879 ] Joris Van den Bossche commented on ARROW-5349: -- Thanks. It is actually also quite clear in the thrift file description of {{file_path}}: "File where column data is stored. If not set, assumed to be same file as metadata. This path is relative to the current file." > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
[ https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845871#comment-16845871 ] Martin Durant commented on ARROW-5349: -- No, I don't have an explicit reference for this, and I believe I got the original model from spark (i.e., presumably same as hive), which I suppose would make it "common" by itself. I think it's the only thing that makes sense, since each data file should be readable in isolation, and there would be no way of knowing it was part of a collection and that the paths should therefore be ignored. At a guess, the design of the standard may have foreseen data-only chunk files, without footer information. > [Python/C++] Provide a way to specify the file path in parquet > ColumnChunkMetaData > -- > > Key: ARROW-5349 > URL: https://issues.apache.org/jira/browse/ARROW-5349 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now > possible to collect the file metadata while writing different files (then how > to write those metadata was not yet addressed -> original issue ARROW-1983). > However, currently, the {{file_path}} information in the ColumnChunkMetaData > object is not set. This is, I think, expected / correct for the metadata as > included within the single file; but for using the metadata in the combined > dataset `_metadata`, it needs a file path set. > So if you want to use this metadata for a partitioned dataset, there needs to > be a way to specify this file path. > Ideas I am thinking of currently: either, we could specify a file path to be > used when writing, or expose the `set_file_path` method on the Python side so > you can create an updated version of the metadata after collecting it. > cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5381) Crash at arrow::internal::CountSetBits
[ https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845741#comment-16845741 ] Tham commented on ARROW-5381: - Thanks for quick response. I'll send this to my customer and ask them to run it. The response will be not as fast as you :) > Crash at arrow::internal::CountSetBits > -- > > Key: ARROW-5381 > URL: https://issues.apache.org/jira/browse/ARROW-5381 > Project: Apache Arrow > Issue Type: Bug > Environment: Operating System: Windows 7 Professional 64-bit (6.1, > Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429) > Language: English (Regional Setting: English) > System Manufacturer: SAMSUNG ELECTRONICS CO., LTD. > System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520 > BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ > Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz > Memory: 2048MB RAM > Available OS Memory: 1962MB RAM > Page File: 1517MB used, 2405MB available > Windows Dir: C:\Windows > DirectX Version: DirectX 11 >Reporter: Tham >Priority: Major > > I've got a lot of crash dump from a customer's windows machine. The > stacktrace shows that it crashed at arrow::internal::CountSetBits. > > {code:java} > STACK_TEXT: > 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` > `1e00 ` : > CortexService!arrow::internal::CountSetBits+0x16d > 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` > ` ` : > CortexService!arrow::ArrayData::GetNullCount+0x8d > 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 > ` ` : > CortexService!arrow::Array::null_count+0x37 > 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 > 00c9`54476080 ` : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::Visit >+0xa5 > 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 > 00c9`5354ab40 ` : > CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298 > 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 > 00c9`54476080 ` : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::VisitInline+0x44 > 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 > 00c9`54476080 00c9`5354b208 : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::GenerateLevels+0x93 > 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 > 00c9`54476080 `1e00 : > CortexService!parquet::arrow::`anonymous > namespace'::ArrowColumnWriter::Write+0x25a > 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 > 00c9`54445c20 ` : > CortexService!parquet::arrow::`anonymous > namespace'::ArrowColumnWriter::Write+0x2a6 > 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 > 00c9`5354b4a8 ` : > CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b > 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 > 00c9`5354b4a8 ` : > CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67 > 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 > ` `1e00 : > CortexService!::operator()+0x195 > 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 > 00c9`54442fb0 `1e00 : > CortexService!parquet::arrow::FileWriter::WriteTable+0x521 > 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 > ` ` : > CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe > 00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 > 00c9`5354b9e0 00c9`5354b9d8 : > CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545 > 00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 > `fffe ` : > CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6 > {code} > {code:java} > FAILED_INSTRUCTION_ADDRESS: > CortexService!arrow::internal::CountSetBits+16d > [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc > @ 99] > 7ff7`2f3a4e4d f3480fb800 popcnt rax,qword ptr [rax] > FOLLOWUP_IP: > CortexService!arrow::internal::CountSetBits+16d > [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc > @ 99] > 7ff7`2f3a4e4d f3480fb800 popcnt rax,qword ptr [rax] > FAULTING_SOU
[jira] [Resolved] (ARROW-5389) [C++] Add an internal temporary directory API
[ https://issues.apache.org/jira/browse/ARROW-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5389. --- Resolution: Fixed Issue resolved by pull request 4364 [https://github.com/apache/arrow/pull/4364] > [C++] Add an internal temporary directory API > - > > Key: ARROW-5389 > URL: https://issues.apache.org/jira/browse/ARROW-5389 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > This is needed to easily write tests involving filesystem operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4800) [C++] Create/port a StatusOr implementation to be able to return a status or a type
[ https://issues.apache.org/jira/browse/ARROW-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845686#comment-16845686 ] Antoine Pitrou commented on ARROW-4800: --- I would rather call this {{Result<>}}. Ideally we would rewrite all Status-returning APIs to return a {{Result<>}} instead. Of course it's probably out of question (both because it breaks compatibility, and because of the huge hassle in refactoring all Arrow code). > [C++] Create/port a StatusOr implementation to be able to return a status or > a type > --- > > Key: ARROW-4800 > URL: https://issues.apache.org/jira/browse/ARROW-4800 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.14.0 >Reporter: Micah Kornfield >Priority: Minor > > Example from grpc: > https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/stubs/statusor.h -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5381) Crash at arrow::internal::CountSetBits
[ https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845678#comment-16845678 ] Antoine Pitrou commented on ARROW-5381: --- Can you download and run this program: https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo Among its output will be a line saying "Supports POPCNT instruction". It will tell you whether the CPU supports the required instruction. > Crash at arrow::internal::CountSetBits > -- > > Key: ARROW-5381 > URL: https://issues.apache.org/jira/browse/ARROW-5381 > Project: Apache Arrow > Issue Type: Bug > Environment: Operating System: Windows 7 Professional 64-bit (6.1, > Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429) > Language: English (Regional Setting: English) > System Manufacturer: SAMSUNG ELECTRONICS CO., LTD. > System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520 > BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ > Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz > Memory: 2048MB RAM > Available OS Memory: 1962MB RAM > Page File: 1517MB used, 2405MB available > Windows Dir: C:\Windows > DirectX Version: DirectX 11 >Reporter: Tham >Priority: Major > > I've got a lot of crash dump from a customer's windows machine. The > stacktrace shows that it crashed at arrow::internal::CountSetBits. > > {code:java} > STACK_TEXT: > 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` > `1e00 ` : > CortexService!arrow::internal::CountSetBits+0x16d > 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` > ` ` : > CortexService!arrow::ArrayData::GetNullCount+0x8d > 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 > ` ` : > CortexService!arrow::Array::null_count+0x37 > 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 > 00c9`54476080 ` : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::Visit >+0xa5 > 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 > 00c9`5354ab40 ` : > CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298 > 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 > 00c9`54476080 ` : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::VisitInline+0x44 > 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 > 00c9`54476080 00c9`5354b208 : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::GenerateLevels+0x93 > 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 > 00c9`54476080 `1e00 : > CortexService!parquet::arrow::`anonymous > namespace'::ArrowColumnWriter::Write+0x25a > 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 > 00c9`54445c20 ` : > CortexService!parquet::arrow::`anonymous > namespace'::ArrowColumnWriter::Write+0x2a6 > 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 > 00c9`5354b4a8 ` : > CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b > 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 > 00c9`5354b4a8 ` : > CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67 > 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 > ` `1e00 : > CortexService!::operator()+0x195 > 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 > 00c9`54442fb0 `1e00 : > CortexService!parquet::arrow::FileWriter::WriteTable+0x521 > 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 > ` ` : > CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe > 00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 > 00c9`5354b9e0 00c9`5354b9d8 : > CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545 > 00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 > `fffe ` : > CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6 > {code} > {code:java} > FAILED_INSTRUCTION_ADDRESS: > CortexService!arrow::internal::CountSetBits+16d > [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc > @ 99] > 7ff7`2f3a4e4d f3480fb800 popcnt rax,qword ptr [rax] > FOLLOWUP_IP: > CortexService!arrow::internal::CountSetBits+16d > [c:\jenkins\workspace\cortexv2-dev-win64-service\
[jira] [Commented] (ARROW-5381) Crash at arrow::internal::CountSetBits
[ https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845673#comment-16845673 ] Tham commented on ARROW-5381: - > Are you running this in a VM? No, it's not a virtual machine. I've got another machine which has the same crash: {code:java} Operating System: Windows 10 Pro 64-bit (10.0, Build 10240) (10240.th1.170602-2340) Language: English (Regional Setting: English) System Manufacturer: HP System Model: HP Laptop 14-bs0xx BIOS: F.31 Processor: Intel(R) Celeron(R) CPU N3060 @ 1.60GHz (2 CPUs), ~1.6GHz Memory: 4096MB RAM Available OS Memory: 4002MB RAM Page File: 2189MB used, 2516MB available Windows Dir: C:\Windows DirectX Version: 12 DX Setup Parameters: Not found User DPI Setting: Using System DPI System DPI Setting: 96 DPI (100 percent) DWM DPI Scaling: Disabled Miracast: Available, with HDCP Microsoft Graphics Hybrid: Not Supported DxDiag Version: 10.00.10240.16384 64bit Unicode {code} Can you please take a look? > Crash at arrow::internal::CountSetBits > -- > > Key: ARROW-5381 > URL: https://issues.apache.org/jira/browse/ARROW-5381 > Project: Apache Arrow > Issue Type: Bug > Environment: Operating System: Windows 7 Professional 64-bit (6.1, > Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429) > Language: English (Regional Setting: English) > System Manufacturer: SAMSUNG ELECTRONICS CO., LTD. > System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520 > BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ > Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz > Memory: 2048MB RAM > Available OS Memory: 1962MB RAM > Page File: 1517MB used, 2405MB available > Windows Dir: C:\Windows > DirectX Version: DirectX 11 >Reporter: Tham >Priority: Major > > I've got a lot of crash dump from a customer's windows machine. The > stacktrace shows that it crashed at arrow::internal::CountSetBits. > > {code:java} > STACK_TEXT: > 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` > `1e00 ` : > CortexService!arrow::internal::CountSetBits+0x16d > 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` > ` ` : > CortexService!arrow::ArrayData::GetNullCount+0x8d > 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 > ` ` : > CortexService!arrow::Array::null_count+0x37 > 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 > 00c9`54476080 ` : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::Visit >+0xa5 > 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 > 00c9`5354ab40 ` : > CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298 > 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 > 00c9`54476080 ` : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::VisitInline+0x44 > 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 > 00c9`54476080 00c9`5354b208 : > CortexService!parquet::arrow::`anonymous > namespace'::LevelBuilder::GenerateLevels+0x93 > 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 > 00c9`54476080 `1e00 : > CortexService!parquet::arrow::`anonymous > namespace'::ArrowColumnWriter::Write+0x25a > 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 > 00c9`54445c20 ` : > CortexService!parquet::arrow::`anonymous > namespace'::ArrowColumnWriter::Write+0x2a6 > 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 > 00c9`5354b4a8 ` : > CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b > 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 > 00c9`5354b4a8 ` : > CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67 > 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 > ` `1e00 : > CortexService!::operator()+0x195 > 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 > 00c9`54442fb0 `1e00 : > CortexService!parquet::arrow::FileWriter::WriteTable+0x521 > 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 > ` ` : > CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe > 00