[jira] [Commented] (ARROW-4800) [C++] Create/port a StatusOr implementation to be able to return a status or a type

2019-05-22 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846467#comment-16846467
 ] 

Micah Kornfield commented on ARROW-4800:


As long as we are bikeshedding, I think I ErrorOr or StatusOr are more 
understandable without looking at the class.

Agreed on trying to replace APIs, but I think this can be somewhat incremental 
as we developer higher level functionality we can go down the stack.  The nice 
thing is these APIs can live side by side since the method signature should 
always be different.

> [C++] Create/port a StatusOr implementation to be able to return a status or 
> a type
> ---
>
> Key: ARROW-4800
> URL: https://issues.apache.org/jira/browse/ARROW-4800
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Micah Kornfield
>Priority: Minor
>
> Example from grpc: 
> https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/stubs/statusor.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4741) [Java] Add documentation to all classes and enable checkstyle for class javadocs

2019-05-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4741:
--
Labels: pull-request-available  (was: )

> [Java] Add documentation to all classes and enable checkstyle for class 
> javadocs
> 
>
> Key: ARROW-4741
> URL: https://issues.apache.org/jira/browse/ARROW-4741
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Documentation, Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>  Labels: pull-request-available
>
> This is likely a big issue.  So it might pay to create subtasks for different 
> modules to add javadoc then do one final cleanup for enabling check-style.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5395) [C++] Utilize stream EOS in File format

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5395:

Summary: [C++] Utilize stream EOS in File format  (was: Utilize stream EOS 
in File format)

> [C++] Utilize stream EOS in File format
> ---
>
> Key: ARROW-5395
> URL: https://issues.apache.org/jira/browse/ARROW-5395
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0.25h
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2885) [C++] Right-justify array values in PrettyPrint

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2885:

Fix Version/s: (was: 0.14.0)

> [C++] Right-justify array values in PrettyPrint
> ---
>
> Key: ARROW-2885
> URL: https://issues.apache.org/jira/browse/ARROW-2885
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
>
> Currently we the output of {{PrettyPrint}} for an array looks as follows:
> {code}
> [
>   1,
>   NA
> ]
> {code}
> We should right-justify it for better readability:
> {code}
> [
>1,
>   NA
> ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2873) [Python] Micro-optimize scalar value instantiation

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2873:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Micro-optimize scalar value instantiation
> --
>
> Key: ARROW-2873
> URL: https://issues.apache.org/jira/browse/ARROW-2873
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Krisztian Szucs
>Priority: Minor
> Fix For: 0.15.0
>
>
> This lead to a 20% time increase in __getitem__: 
> https://pandas.pydata.org/speed/arrow/#array_ops.ScalarAccess.time_getitem
> See conversation: 
> https://github.com/apache/arrow/commit/dc80a768c0a15e62998ccd32d8353d2035302cb6#r29746119



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2628:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2625) [Python] Serialize timedelta64 values from pandas to Arrow interval types

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2625:

Fix Version/s: (was: 0.14.0)

> [Python] Serialize timedelta64 values from pandas to Arrow interval types
> -
>
> Key: ARROW-2625
> URL: https://issues.apache.org/jira/browse/ARROW-2625
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> This work is blocked on ARROW-835



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2621) [Python/CI] Use pep8speaks for Python PRs

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2621:

Fix Version/s: (was: 0.14.0)

> [Python/CI] Use pep8speaks for Python PRs
> -
>
> Key: ARROW-2621
> URL: https://issues.apache.org/jira/browse/ARROW-2621
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Continuous Integration, Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
>
> It would be nice if we would get automated comments by 
> [https://pep8speaks.com/] on the Python PRs. This should be much better 
> readable than the current `flake8` ouput in the Travis logs. This issue is 
> split up into two tasks:
>  * Create an issue with INFRA kindly asking them for activating pep8speaks 
> for Arrow
>  * Setup {{.pep8speaks.yml}} to align with our {{flake8}} config. For 
> reference, see Pandas' config: 
> [https://github.com/pandas-dev/pandas/blob/master/.pep8speaks.yml] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1299) [Doc] Publish nightly documentation against master somewhere

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1299:

Fix Version/s: (was: 0.14.0)

> [Doc] Publish nightly documentation against master somewhere
> 
>
> Key: ARROW-1299
> URL: https://issues.apache.org/jira/browse/ARROW-1299
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Wes McKinney
>Priority: Major
>
> This will help catch problems with the generated documentation prior to 
> release time, and also allow users to read the latest prose documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1489) [C++] Add casting option to set unsafe casts to null rather than some garbage value

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1489:

Fix Version/s: (was: 0.14.0)

> [C++] Add casting option to set unsafe casts to null rather than some garbage 
> value
> ---
>
> Key: ARROW-1489
> URL: https://issues.apache.org/jira/browse/ARROW-1489
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> Null is the obvious choice when certain casts fail, like string to number, 
> but in other kinds of unsafe casts there may be more ambiguity. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5371) [Release] Add tests for dev/release/00-prepare.sh

2019-05-22 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-5371.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4343
[https://github.com/apache/arrow/pull/4343]

> [Release] Add tests for dev/release/00-prepare.sh
> -
>
> Key: ARROW-5371
> URL: https://issues.apache.org/jira/browse/ARROW-5371
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Packaging
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5398) [Python] Flight tests broken by URI changes

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5398:

Component/s: FlightRPC

> [Python] Flight tests broken by URI changes
> ---
>
> Key: ARROW-5398
> URL: https://issues.apache.org/jira/browse/ARROW-5398
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> The URI changes merged cleanly but they hadn't been rebased so this is 
> happening
> https://travis-ci.org/apache/arrow/jobs/535981561#L5267



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5398) [Python] Flight tests broken by URI changes

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5398:

Issue Type: Bug  (was: Improvement)

> [Python] Flight tests broken by URI changes
> ---
>
> Key: ARROW-5398
> URL: https://issues.apache.org/jira/browse/ARROW-5398
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> The URI changes merged cleanly but they hadn't been rebased so this is 
> happening
> https://travis-ci.org/apache/arrow/jobs/535981561#L5267



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5398) [Python] Flight tests broken by Uri changes

2019-05-22 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5398:
---

 Summary: [Python] Flight tests broken by Uri changes
 Key: ARROW-5398
 URL: https://issues.apache.org/jira/browse/ARROW-5398
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.14.0


The URI changes merged cleanly but they hadn't been rebased so this is happening

https://travis-ci.org/apache/arrow/jobs/535981561#L5267



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5398) [Python] Flight tests broken by URI changes

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5398:

Summary: [Python] Flight tests broken by URI changes  (was: [Python] Flight 
tests broken by Uri changes)

> [Python] Flight tests broken by URI changes
> ---
>
> Key: ARROW-5398
> URL: https://issues.apache.org/jira/browse/ARROW-5398
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> The URI changes merged cleanly but they hadn't been rebased so this is 
> happening
> https://travis-ci.org/apache/arrow/jobs/535981561#L5267



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5396) [JS] Ensure reader and writer support files and streams with no RecordBatches

2019-05-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5396:
--
Labels: pull-request-available  (was: )

> [JS] Ensure reader and writer support files and streams with no RecordBatches
> -
>
> Key: ARROW-5396
> URL: https://issues.apache.org/jira/browse/ARROW-5396
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Affects Versions: 0.13.0
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Re: https://issues.apache.org/jira/browse/ARROW-2119 and 
> [https://github.com/apache/arrow/pull/3871], the JS reader and writer should 
> support files and streams with a Schema but no RecordBatches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2703:

Fix Version/s: (was: 0.14.0)

> [C++] Always use statically-linked Boost with private namespace
> ---
>
> Key: ARROW-2703
> URL: https://issues.apache.org/jira/browse/ARROW-2703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have recently added tooling to ship Python wheels with a bundled, private 
> Boost (using the bcp tool). We might consider statically-linking a private 
> Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to 
> avoid any conflicts with other libraries that may use a different version of 
> Boost



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2703) [C++] Always use statically-linked Boost with private namespace

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846159#comment-16846159
 ] 

Wes McKinney commented on ARROW-2703:
-

Removing this from any milestone as it bears more discussion and is not an 
urgency

> [C++] Always use statically-linked Boost with private namespace
> ---
>
> Key: ARROW-2703
> URL: https://issues.apache.org/jira/browse/ARROW-2703
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We have recently added tooling to ship Python wheels with a bundled, private 
> Boost (using the bcp tool). We might consider statically-linking a private 
> Boost exclusively in libarrow (i.e. built via our thirdparty toolchain) to 
> avoid any conflicts with other libraries that may use a different version of 
> Boost



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2217) [C++] Add option to use dynamic linking for compression library dependencies

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2217.
-
Resolution: Fixed

It seems that this is resolved now so long as the dynamic libraries are 
available

{code}
$ ldd ~/local/lib/libarrow.so
linux-vdso.so.1 (0x7ffd1b748000)
libbrotlienc.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlienc.so.1 
(0x7fed95782000)
libbrotlidec.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlidec.so.1 
(0x7fed95773000)
libglog.so.0 => /home/wesm/cpp-runtime-toolchain/lib/libglog.so.0 
(0x7fed9573f000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7fed95739000)
libbz2.so.1.0 => /home/wesm/cpp-runtime-toolchain/lib/libbz2.so.1.0 
(0x7fed95725000)
liblz4.so.1 => /home/wesm/cpp-runtime-toolchain/lib/liblz4.so.1 
(0x7fed95515000)
libsnappy.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libsnappy.so.1 
(0x7fed95508000)
libz.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libz.so.1 
(0x7fed954ee000)
libzstd.so.1.3.8 => 
/home/wesm/cpp-runtime-toolchain/lib/libzstd.so.1.3.8 (0x7fed9544)
libboost_system.so.1.68.0 => 
/home/wesm/cpp-runtime-toolchain/lib/libboost_system.so.1.68.0 
(0x7fed95439000)
libboost_filesystem.so.1.68.0 => 
/home/wesm/cpp-runtime-toolchain/lib/libboost_filesystem.so.1.68.0 
(0x7fed9541b000)
libboost_regex.so.1.68.0 => 
/home/wesm/cpp-runtime-toolchain/lib/libboost_regex.so.1.68.0 
(0x7fed95312000)
libstdc++.so.6 => /home/wesm/cpp-runtime-toolchain/lib/libstdc++.so.6 
(0x7fed951ce000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7fed9508)
libgcc_s.so.1 => /home/wesm/cpp-runtime-toolchain/lib/libgcc_s.so.1 
(0x7fed9506c000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7fed9504b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7fed94e6)
/lib64/ld-linux-x86-64.so.2 (0x7fed96a89000)
libbrotlicommon.so.1 => /usr/lib/x86_64-linux-gnu/libbrotlicommon.so.1 
(0x7fed94e3d000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7fed94e3)
libicudata.so.58 => 
/home/wesm/cpp-runtime-toolchain/lib/./libicudata.so.58 (0x7fed9352e000)
libicui18n.so.58 => 
/home/wesm/cpp-runtime-toolchain/lib/./libicui18n.so.58 (0x7fed932af000)
libicuuc.so.58 => /home/wesm/cpp-runtime-toolchain/lib/./libicuuc.so.58 
(0x7fed930fc000)
{code}

> [C++] Add option to use dynamic linking for compression library dependencies
> 
>
> Key: ARROW-2217
> URL: https://issues.apache.org/jira/browse/ARROW-2217
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.14.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1661



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2221) [C++] Nightly build with "infer" tool

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2221:

Fix Version/s: (was: 0.14.0)

> [C++] Nightly build with "infer" tool
> -
>
> Key: ARROW-2221
> URL: https://issues.apache.org/jira/browse/ARROW-2221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> As a follow-up to ARROW-1626, we ought to periodically look at the output of 
> the "infer" tool to fix issues as they come up. This is probably too 
> heavyweight to run in each CI build
> cc [~renesugar]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2967) [Python] Add option to treat invalid PyObject* values as null in pyarrow.array

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2967:

Fix Version/s: (was: 0.14.0)

> [Python] Add option to treat invalid PyObject* values as null in pyarrow.array
> --
>
> Key: ARROW-2967
> URL: https://issues.apache.org/jira/browse/ARROW-2967
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> See discussion in ARROW-2966



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2461) [Python] Build wheels for manylinux2010 tag

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2461:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Build wheels for manylinux2010 tag
> ---
>
> Key: ARROW-2461
> URL: https://issues.apache.org/jira/browse/ARROW-2461
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Blocker
> Fix For: 0.15.0
>
>
> There is now work in progress on an updated manylinux tag based on CentOS6. 
> We should provide wheels for this tag and the old {{manylinux1}} tag for one 
> release and then switch to the new tag in the release afterwards. This should 
> enable us also to raise the minimum compiler requirement to gcc 4.9 (or 
> higher once conda-forge has migrated to a newer compiler).
> The relevant PEP is https://www.python.org/dev/peps/pep-0571/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5254) [Flight][Java] DoAction does not support result streams

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5254.
-
Resolution: Fixed

Issue resolved by pull request 4250
[https://github.com/apache/arrow/pull/4250]

> [Flight][Java] DoAction does not support result streams
> ---
>
> Key: ARROW-5254
> URL: https://issues.apache.org/jira/browse/ARROW-5254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While Flight defines DoAction as returning a stream of results, the Java APIs 
> only allow returning a single result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5395:
--
Labels: pull-request-available  (was: )

> Utilize stream EOS in File format
> -
>
> Key: ARROW-5395
> URL: https://issues.apache.org/jira/browse/ARROW-5395
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5239) Add support for interval types in javascript

2019-05-22 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846130#comment-16846130
 ] 

Micah Kornfield commented on ARROW-5239:


It sounds like you just need to add duration, and make sure that you can
remove:
https://github.com/apache/arrow/blob/master/integration/integration_test.py#L1109




> Add support for interval types in javascript
> 
>
> Key: ARROW-5239
> URL: https://issues.apache.org/jira/browse/ARROW-5239
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Major
>
> Update integration_test.py to include interval tests for JSTest once this is 
> done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5397) Test Flight TLS support

2019-05-22 Thread David Li (JIRA)
David Li created ARROW-5397:
---

 Summary: Test Flight TLS support 
 Key: ARROW-5397
 URL: https://issues.apache.org/jira/browse/ARROW-5397
 Project: Apache Arrow
  Issue Type: Test
  Components: FlightRPC
Reporter: David Li


TLS support is not tested in Flight. We need to generate certificates/keys and 
provide them to the language-specific test runners.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5239) Add support for interval types in javascript

2019-05-22 Thread Paul Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846108#comment-16846108
 ] 

Paul Taylor commented on ARROW-5239:


We have the Interval year_month and day_time types in JS, but I'm not sure if 
this issue is about a new kind of Interval DataType. [~emkornfi...@gmail.com], 
any thoughts?

> Add support for interval types in javascript
> 
>
> Key: ARROW-5239
> URL: https://issues.apache.org/jira/browse/ARROW-5239
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Major
>
> Update integration_test.py to include interval tests for JSTest once this is 
> done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5318) [Python] pyarrow hdfs reader overrequests

2019-05-22 Thread Ivan Dimitrov (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846106#comment-16846106
 ] 

Ivan Dimitrov commented on ARROW-5318:
--

Ticket Resolution: 

The culprit was caching. The driver has functionality for pread that isn't 
exposed in the pyarrow api. The solution is to add a method for NativeFile to 
read_at at a specific offset. The function will make an underlying pread call 
that does not cache. 

> [Python] pyarrow hdfs reader overrequests  
> ---
>
> Key: ARROW-5318
> URL: https://issues.apache.org/jira/browse/ARROW-5318
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Ivan Dimitrov
>Priority: Blocker
>
> I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, 
> I often get 0%-300% more data sent over the network. My suspicion is that 
> pyarrow is reading ahead.
> The pyarrow parquet reader doesn't have this behavior, and I am looking for a 
> way to turn off read ahead for the general HDFS interface.
> I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 
> (newest released version). I am on python 2.7
> I have been using wireshark to track the packets passed on the network.
> I suspect it is read ahead since the time for the 1st read is much greater 
> than the time for 2nd read.
>  
> The regular pyarrow reader
> {code:java}
> import pyarrow as pa 
> fs = pa.hdfs.connect(hostname, driver='libhdfs') 
> file_path = 'dataset/train/piece' 
> f = fs.open(file_path) 
> f.seek(0) 
> n_bytes = 300 
> f.read(n_bytes)
> {code}
>  
> Parquet code without the same issue
> {code:java}
> parquet_file = 'dataset/train/parquet/part-22e3' 
> pf = fs.open(parquet_path) 
> pqf = pa.parquet.ParquetFile(pf)
> data = pqf.read_row_group(0, columns=['col_name'])
>  {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4651) [Format] Flight Location should be more flexible than a (host, port) pair

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4651.
-
Resolution: Fixed

Issue resolved by pull request 4047
[https://github.com/apache/arrow/pull/4047]

> [Format] Flight Location should be more flexible than a (host, port) pair
> -
>
> Key: ARROW-4651
> URL: https://issues.apache.org/jira/browse/ARROW-4651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Format
>Affects Versions: 0.12.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> The more future-proof solution is probably to define a URI format. gRPC 
> already has something like that, though we might want to define our own 
> format:
> https://grpc.io/grpc/cpp/md_doc_naming.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4651) [Format] Flight Location should be more flexible than a (host, port) pair

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4651:
---

Assignee: David Li

> [Format] Flight Location should be more flexible than a (host, port) pair
> -
>
> Key: ARROW-4651
> URL: https://issues.apache.org/jira/browse/ARROW-4651
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Format
>Affects Versions: 0.12.0
>Reporter: Antoine Pitrou
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> The more future-proof solution is probably to define a URI format. gRPC 
> already has something like that, though we might want to define our own 
> format:
> https://grpc.io/grpc/cpp/md_doc_naming.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5396) [JS] Ensure reader and writer support files and streams with no RecordBatches

2019-05-22 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-5396:
--

 Summary: [JS] Ensure reader and writer support files and streams 
with no RecordBatches
 Key: ARROW-5396
 URL: https://issues.apache.org/jira/browse/ARROW-5396
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: 0.13.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: 0.14.0


Re: https://issues.apache.org/jira/browse/ARROW-2119 and 
[https://github.com/apache/arrow/pull/3871], the JS reader and writer should 
support files and streams with a Schema but no RecordBatches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread John Muehlhausen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846074#comment-16846074
 ] 

John Muehlhausen commented on ARROW-5395:
-

https://github.com/apache/arrow/pull/4372

> Utilize stream EOS in File format
> -
>
> Key: ARROW-5395
> URL: https://issues.apache.org/jira/browse/ARROW-5395
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: John Muehlhausen
>Priority: Minor
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5395:
---

 Summary: Utilize stream EOS in File format
 Key: ARROW-5395
 URL: https://issues.apache.org/jira/browse/ARROW-5395
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: John Muehlhausen


We currently do not write EOS at the end of a Message stream inside the File 
format.  As a result, the file cannot be parsed sequentially.  This change 
prepares for other implementations or future reference features that parse a 
File sequentially... i.e. without access to seek().

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5394) [C++] Benchmarks for IsIn Kernel

2019-05-22 Thread Preeti Suman (JIRA)
Preeti Suman created ARROW-5394:
---

 Summary: [C++] Benchmarks for IsIn Kernel
 Key: ARROW-5394
 URL: https://issues.apache.org/jira/browse/ARROW-5394
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Preeti Suman


Add benchmarks for IsIn kernel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5392) [C++][CI][MinGW] Disable static library build on AppVeyor

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5392.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4367
[https://github.com/apache/arrow/pull/4367]

> [C++][CI][MinGW] Disable static library build on AppVeyor
> -
>
> Key: ARROW-5392
> URL: https://issues.apache.org/jira/browse/ARROW-5392
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5156) [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`

2019-05-22 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845979#comment-16845979
 ] 

Martin Durant commented on ARROW-5156:
--

Happy to add `_isfilestore` to s3fs/fsspec - I assume it just should return 
True?

> [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with 
> `'NoneType' object has no attribute '_isfilestore'`
> ---
>
> Key: ARROW-5156
> URL: https://issues.apache.org/jira/browse/ARROW-5156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: Mac, Linux
>Reporter: Victor Shih
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> According to 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files],
>  writing a parquet to S3 with `partition_cols` should work, but it fails for 
> me. Example script:
> {code:java}
> import pandas as pd
> import sys
> print(sys.version)
> print(pd._version_)
> df = pd.DataFrame([{'a': 1, 'b': 2}])
> df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow')
> print('OK 1')
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> print('OK 2')
> {code}
> Output:
> {noformat}
> 3.5.2 (default, Feb 14 2019, 01:46:27)
> [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)]
> 0.24.2
> OK 1
> Traceback (most recent call last):
> File "./t.py", line 14, in 
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py",
>  line 2203, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 252, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 118, in write
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1227, in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1182, in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> AttributeError: 'NoneType' object has no attribute '_isfilestore'
> {noformat}
>  
> Original issue - [https://github.com/apache/arrow/issues/4030]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-465) [C++] Investigate usage of madvise

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-465:
---
Fix Version/s: (was: 0.14.0)

> [C++] Investigate usage of madvise 
> ---
>
> Key: ARROW-465
> URL: https://issues.apache.org/jira/browse/ARROW-465
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Uwe L. Korn
>Priority: Major
>
> In some usecases (e.g. Pandas->Arrow conversion) our main constraint is page 
> faulting not yet accessed pages. 
> With {{madvise}} we can indicate our planned actions to the OS and may 
> improve the performance a bit in these cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-976) [Python] Provide API for defining and reading Parquet datasets with more ad hoc partition schemes

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-976:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Provide API for defining and reading Parquet datasets with more ad 
> hoc partition schemes
> -
>
> Key: ARROW-976
> URL: https://issues.apache.org/jira/browse/ARROW-976
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5156) [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object has no attribute '_isfilestore'`

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5156:

Summary: [Python] `df.to_parquet('s3://...', partition_cols=...)` fails 
with `'NoneType' object has no attribute '_isfilestore'`  (was: 
`df.to_parquet('s3://...', partition_cols=...)` fails with `'NoneType' object 
has no attribute '_isfilestore'`)

> [Python] `df.to_parquet('s3://...', partition_cols=...)` fails with 
> `'NoneType' object has no attribute '_isfilestore'`
> ---
>
> Key: ARROW-5156
> URL: https://issues.apache.org/jira/browse/ARROW-5156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
> Environment: Mac, Linux
>Reporter: Victor Shih
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> According to 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#partitioning-parquet-files],
>  writing a parquet to S3 with `partition_cols` should work, but it fails for 
> me. Example script:
> {code:java}
> import pandas as pd
> import sys
> print(sys.version)
> print(pd._version_)
> df = pd.DataFrame([{'a': 1, 'b': 2}])
> df.to_parquet('s3://my_s3_bucket/x.parquet', engine='pyarrow')
> print('OK 1')
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> print('OK 2')
> {code}
> Output:
> {noformat}
> 3.5.2 (default, Feb 14 2019, 01:46:27)
> [GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.5)]
> 0.24.2
> OK 1
> Traceback (most recent call last):
> File "./t.py", line 14, in 
> df.to_parquet('s3://my_s3_bucket/x2.parquet', partition_cols=['a'], 
> engine='pyarrow')
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/core/frame.py",
>  line 2203, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 252, in to_parquet
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pandas/io/parquet.py",
>  line 118, in write
> partition_cols=partition_cols, **kwargs)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1227, in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> File 
> "/Users/vshih/.pyenv/versions/3.5.2/lib/python3.5/site-packages/pyarrow/parquet.py",
>  line 1182, in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> AttributeError: 'NoneType' object has no attribute '_isfilestore'
> {noformat}
>  
> Original issue - [https://github.com/apache/arrow/issues/4030]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5279) [C++] Support reading delta dictionaries in IPC streams

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845974#comment-16845974
 ] 

Wes McKinney commented on ARROW-5279:
-

Don't think I can get to this for 0.14

> [C++] Support reading delta dictionaries in IPC streams
> ---
>
> Key: ARROW-5279
> URL: https://issues.apache.org/jira/browse/ARROW-5279
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This JIRA covers the read path for delta dictionaries. The write path is a 
> bit more of a can of worms (since the deltas must be computed)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5279) [C++] Support reading delta dictionaries in IPC streams

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5279:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Support reading delta dictionaries in IPC streams
> ---
>
> Key: ARROW-5279
> URL: https://issues.apache.org/jira/browse/ARROW-5279
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This JIRA covers the read path for delta dictionaries. The write path is a 
> bit more of a can of worms (since the deltas must be computed)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5128) [Packaging][CentOS][Conda] Numpy not found in nightly builds

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845972#comment-16845972
 ] 

Wes McKinney commented on ARROW-5128:
-

What is the status of this?

> [Packaging][CentOS][Conda] Numpy not found in nightly builds
> 
>
> Key: ARROW-5128
> URL: https://issues.apache.org/jira/browse/ARROW-5128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 0.14.0
>
>
> In the last three days centos-7 and conda-win builds have been failing with 
> numpy not found
> - https://travis-ci.org/kszucs/crossbow/builds/515638053
> - https://ci.appveyor.com/project/kszucs/crossbow/builds/23593736
> - https://ci.appveyor.com/project/kszucs/crossbow/builds/23563730



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5069) [C++] Implement direct support for shared memory arrow columns

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845970#comment-16845970
 ] 

Wes McKinney commented on ARROW-5069:
-

[~dimlek] It seems like you would need to draft a more detailed proposal 
document to go into detail about how things should ideally work. The Arrow data 
structures can reference the memory from any {{Buffer}} subclass, and we 
already have examples of referencing shared memory and GPU memory. So all of 
the machinery is built already. The question becomes what kind of API can yield 
shared-memory data structures. I'm interested to see what you propose

> [C++] Implement direct support for shared memory arrow columns
> --
>
> Key: ARROW-5069
> URL: https://issues.apache.org/jira/browse/ARROW-5069
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: Linux
>Reporter: Dimitris Lekkas
>Priority: Major
>  Labels: perfomance, proposal
>
> I consider the option of memory-mapping columns to shared memory to be 
> valuable. Such option will be triggered if specific metadata are supplied. 
> Given that many data frames backed by arrow are used for machine learning I 
> guess we could somehow benefit from treating differently the data (most 
> likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable 
> such change we would need to address the following issues:
> First, we need each column to hold an integer value representing its 
> associated file descriptor. The application developer could retrieve the 
> file-name from the file descriptor (i.e fstat syscall) and inform another 
> application to reference that file or inform an FPGA to DMA that memory-area.
> We also need to support variable buffer alignment (restricted to powers-of-2 
> of course)  when initiating an arrow::AllocateBuffer() call. By inspecting 
> the current implementation, the alignment size is fixed at 64 bytes and to 
> change that value a recompilation is required [1].
> To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit 
> heavily from page-aligned buffers since their device memory is 4KB [2]. 
> Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned 
> buffer from CPU memory to FPGA's memory [3]. 
> Wouldn't it be nice if we could issue from_pandas() and then have our columns 
> memory mapped to shared memory for FPGAs to DMA such memory and accelerate 
> the workload? If there is already a workaround to achieve that I would like 
> more info on that.
> I am open to discuss any suggestions, improvements or concerns. 
>  
> [1]: 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40]
> [2]: 
> [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593]
> [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5069) [C++] Implement direct support for shared memory arrow columns

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5069:

Fix Version/s: (was: 0.14.0)

> [C++] Implement direct support for shared memory arrow columns
> --
>
> Key: ARROW-5069
> URL: https://issues.apache.org/jira/browse/ARROW-5069
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: Linux
>Reporter: Dimitris Lekkas
>Priority: Major
>  Labels: perfomance, proposal
>
> I consider the option of memory-mapping columns to shared memory to be 
> valuable. Such option will be triggered if specific metadata are supplied. 
> Given that many data frames backed by arrow are used for machine learning I 
> guess we could somehow benefit from treating differently the data (most 
> likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable 
> such change we would need to address the following issues:
> First, we need each column to hold an integer value representing its 
> associated file descriptor. The application developer could retrieve the 
> file-name from the file descriptor (i.e fstat syscall) and inform another 
> application to reference that file or inform an FPGA to DMA that memory-area.
> We also need to support variable buffer alignment (restricted to powers-of-2 
> of course)  when initiating an arrow::AllocateBuffer() call. By inspecting 
> the current implementation, the alignment size is fixed at 64 bytes and to 
> change that value a recompilation is required [1].
> To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit 
> heavily from page-aligned buffers since their device memory is 4KB [2]. 
> Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned 
> buffer from CPU memory to FPGA's memory [3]. 
> Wouldn't it be nice if we could issue from_pandas() and then have our columns 
> memory mapped to shared memory for FPGAs to DMA such memory and accelerate 
> the workload? If there is already a workaround to achieve that I would like 
> more info on that.
> I am open to discuss any suggestions, improvements or concerns. 
>  
> [1]: 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40]
> [2]: 
> [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593]
> [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5069) [C++] Implement direct support for shared memory arrow columns

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5069:

Summary: [C++] Implement direct support for shared memory arrow columns  
(was: Implement direct support for shared memory arrow columns)

> [C++] Implement direct support for shared memory arrow columns
> --
>
> Key: ARROW-5069
> URL: https://issues.apache.org/jira/browse/ARROW-5069
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
> Environment: Linux
>Reporter: Dimitris Lekkas
>Priority: Major
>  Labels: perfomance, proposal
> Fix For: 0.14.0
>
>
> I consider the option of memory-mapping columns to shared memory to be 
> valuable. Such option will be triggered if specific metadata are supplied. 
> Given that many data frames backed by arrow are used for machine learning I 
> guess we could somehow benefit from treating differently the data (most 
> likely data buffer columns) that will be fed into the GPUs/FPGAs. To enable 
> such change we would need to address the following issues:
> First, we need each column to hold an integer value representing its 
> associated file descriptor. The application developer could retrieve the 
> file-name from the file descriptor (i.e fstat syscall) and inform another 
> application to reference that file or inform an FPGA to DMA that memory-area.
> We also need to support variable buffer alignment (restricted to powers-of-2 
> of course)  when initiating an arrow::AllocateBuffer() call. By inspecting 
> the current implementation, the alignment size is fixed at 64 bytes and to 
> change that value a recompilation is required [1].
> To justify the above suggestion, major FPGA vendors (i.e Xilinx) benefit 
> heavily from page-aligned buffers since their device memory is 4KB [2]. 
> Particularly, Xilinx warns users if they attempt to memcpy a non-page-aligned 
> buffer from CPU memory to FPGA's memory [3]. 
> Wouldn't it be nice if we could issue from_pandas() and then have our columns 
> memory mapped to shared memory for FPGAs to DMA such memory and accelerate 
> the workload? If there is already a workaround to achieve that I would like 
> more info on that.
> I am open to discuss any suggestions, improvements or concerns. 
>  
> [1]: 
> [https://github.com/apache/arrow/blob/master/cpp/src/arrow/memory_pool.cc#L40]
> [2]: 
> [https://forums.xilinx.com/t5/SDAccel/memory-alignment-when-allocating-emmory-in-SDAccel/td-p/887593]
> [3]: [https://forums.aws.amazon.com/thread.jspa?messageID=884615&tstart=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5066) [Integration] Add flags to enable/disable implementations in integration/integration_test.py

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5066.
-
Resolution: Fixed
  Assignee: Wes McKinney

I think the flags added in ARROW-3144 are sufficient

> [Integration] Add flags to enable/disable implementations in 
> integration/integration_test.py
> 
>
> Key: ARROW-5066
> URL: https://issues.apache.org/jira/browse/ARROW-5066
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> This will make it easier to test pairwise binary protocol integration (e.g. 
> only C++ vs JS, or Java vs C++). Currently it's an all-or-nothing affair



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1894:

Fix Version/s: (was: 0.14.0)

> [Python] Treat CPython memoryview or buffer objects equivalently to 
> pyarrow.Buffer in pyarrow.serialize
> ---
>
> Key: ARROW-1894
> URL: https://issues.apache.org/jira/browse/ARROW-1894
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>
> These should be treated as Buffer-like on serialize. We should consider how 
> to "box" the buffers as the appropriate kind of object (Buffer, memoryview, 
> etc.) when being deserialized



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-5052) [C++] Add an incomplete dictionary type

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-5052.
---
   Resolution: Won't Fix
Fix Version/s: (was: 0.14.0)

Closing in favor of solution merged in ARROW-3144

> [C++] Add an incomplete dictionary type
> ---
>
> Key: ARROW-5052
> URL: https://issues.apache.org/jira/browse/ARROW-5052
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This would allow to pass a {{DataType}} that means "dict-encoded data with 
> the given index types and value types, but the actual values are not yet 
> known" (they might be inferred by processing non-dict-encoded data, or they 
> might be transferred explicitly - but later - in the data stream).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1741) [C++] Comparison function for DictionaryArray to determine if indices are "compatible"

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845961#comment-16845961
 ] 

Wes McKinney commented on ARROW-1741:
-

Now that we have variable dictionaries in C++, having a function to determine 
if two DictionaryArrays can be compared without a unification step would be 
useful. 

> [C++] Comparison function for DictionaryArray to determine if indices are 
> "compatible"
> --
>
> Key: ARROW-1741
> URL: https://issues.apache.org/jira/browse/ARROW-1741
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> For example, if one array's dictionary is larger than the other, but the 
> overlapping beginning portion is the same, then the respective dictionary 
> indices correspond to the same values. Therefore, in analytics, one may 
> choose to drop the smaller dictionary in favor of the larger dictionary, and 
> this need not incur any computational overhead (beyond comparing the 
> dictionary prefixes -- there may be some way to engineer "dictionary lineage" 
> to make this comparison even cheaper)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1789) [Format] Consolidate specification documents and improve clarity for new implementation authors

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1789:

Fix Version/s: (was: 0.14.0)
   1.0.0

> [Format] Consolidate specification documents and improve clarity for new 
> implementation authors
> ---
>
> Key: ARROW-1789
> URL: https://issues.apache.org/jira/browse/ARROW-1789
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Micah Kornfield
>Priority: Major
> Fix For: 1.0.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1296
> I believe the specification documents Layout.md, Metadata.md, and IPC.md 
> would benefit from being consolidated into a single Markdown document that 
> would be sufficient (along with the Flatbuffers schemas) to create a complete 
> Arrow implementation capable of reading and writing the binary format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1786) [Format] List expected on-wire buffer layouts for each kind of Arrow physical type in specification

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1786:

Fix Version/s: (was: 0.14.0)
   1.0.0

> [Format] List expected on-wire buffer layouts for each kind of Arrow physical 
> type in specification
> ---
>
> Key: ARROW-1786
> URL: https://issues.apache.org/jira/browse/ARROW-1786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: columnar-format-1.0
> Fix For: 1.0.0
>
>
> see ARROW-1693, ARROW-1785



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1700) [JS] Implement Node.js client for Plasma store

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1700:

Fix Version/s: (was: 0.14.0)

> [JS] Implement Node.js client for Plasma store
> --
>
> Key: ARROW-1700
> URL: https://issues.apache.org/jira/browse/ARROW-1700
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Plasma, JavaScript
>Reporter: Robert Nishihara
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1599) [C++][Parquet] Unable to read Parquet files with list inside struct

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1599:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++][Parquet] Unable to read Parquet files with list inside struct
> ---
>
> Key: ARROW-1599
> URL: https://issues.apache.org/jira/browse/ARROW-1599
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.7.0
> Environment: Ubuntu
>Reporter: Jovann Kung
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
>
> Is PyArrow currently unable to read in Parquet files with a vector as a 
> column? For example, the schema of such a file is below:
> {{
> mbc: FLOAT
> deltae: FLOAT
> labels: FLOAT
> features.type: INT32 INT_8
> features.size: INT32
> features.indices.list.element: INT32
> features.values.list.element: DOUBLE}}
> Using either pq.read_table() or pq.ParquetDataset('/path/to/parquet').read() 
> yields the following error: ArrowNotImplementedError: Currently only nesting 
> with Lists is supported.
> From the error I assume that this may be implemented in further releases?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1644) [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1644:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++][Parquet] Read and write nested Parquet data with a mix of struct and 
> list nesting levels
> --
>
> Key: ARROW-1644
> URL: https://issues.apache.org/jira/browse/ARROW-1644
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: DB Tsai
>Assignee: Joshua Storck
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
>
> We have many nested parquet files generated from Apache Spark for ranking 
> problems, and we would like to load them in python for other programs to 
> consume. 
> The schema looks like 
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- show_title_id: integer (nullable = true)
>  |||-- duration: double (nullable = true)
> {code}
> And when I tried to load it with nightly build pyarrow on Oct 4, 2017, I got 
> the following error.
> {code:python}
> Python 3.6.2 |Anaconda, Inc.| (default, Sep 30 2017, 18:42:57) 
> [GCC 7.2.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pandas as pd
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> table2 = pq.read_table('part-0')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 823, in read_table
> use_pandas_metadata=use_pandas_metadata)
>   File "/home/dbt/miniconda3/lib/python3.6/site-packages/pyarrow/parquet.py", 
> line 119, in read
> nthreads=nthreads)
>   File "_parquet.pyx", line 466, in pyarrow._parquet.ParquetReader.read_all
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: lists with structs are not supported.
> {code}
> I somehow get the impression that after 
> https://issues.apache.org/jira/browse/PARQUET-911 is merged, we should be 
> able to load the nested parquet in pyarrow. 
> Any insight about this? 
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1682) [Python] Add documentation / example for reading a directory of Parquet files on S3

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1682:

Fix Version/s: (was: 0.14.0)

> [Python] Add documentation / example for reading a directory of Parquet files 
> on S3
> ---
>
> Key: ARROW-1682
> URL: https://issues.apache.org/jira/browse/ARROW-1682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem, parquet
>
> Opened based on comment 
> https://github.com/apache/arrow/pull/916#issuecomment-337563492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1621) [JAVA] Reduce Heap Usage per Vector

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845956#comment-16845956
 ] 

Wes McKinney commented on ARROW-1621:
-

[~siddteotia] ?

> [JAVA] Reduce Heap Usage per Vector
> ---
>
> Key: ARROW-1621
> URL: https://issues.apache.org/jira/browse/ARROW-1621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.14.0
>
>
> https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1570) [C++] Define API for creating a kernel instance from function of scalar input and output with a particular signature

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1570:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Define API for creating a kernel instance from function of scalar input 
> and output with a particular signature
> 
>
> Key: ARROW-1570
> URL: https://issues.apache.org/jira/browse/ARROW-1570
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.15.0
>
>
> This could include an {{std::function}} instance (but these cannot be inlined 
> by the C++ compiler), but should also permit use with inline-able functions 
> or functors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1470) [C++] Add BufferAllocator abstract interface

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1470:

Fix Version/s: (was: 0.14.0)

> [C++] Add BufferAllocator abstract interface
> 
>
> Key: ARROW-1470
> URL: https://issues.apache.org/jira/browse/ARROW-1470
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>
> There are some situations ({{arrow::ipc::SerializeRecordBatch}} where we pass 
> a {{MemoryPool*}} solely to call {{AllocateBuffer}} using it. This is not as 
> flexible as it could be, since there are situation where we may wish to 
> allocate from shared memory instead. 
> So instead:
> {code}
> Func(..., BufferAllocator* allocator, ...) {
>   ...
>   std::shared_ptr buffer;
>   RETURN_NOT_OK(allocator->Allocate(nbytes, &buffer));
>   ...
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1324) [C++] Support ARROW_BOOST_VENDORED on Windows / MSVC

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1324:

Fix Version/s: (was: 0.14.0)

> [C++] Support ARROW_BOOST_VENDORED on Windows / MSVC
> 
>
> Key: ARROW-1324
> URL: https://issues.apache.org/jira/browse/ARROW-1324
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: windows
>
> Follow up to ARROW-1303



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1382) [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1382:

Fix Version/s: (was: 0.14.0)

> [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize
> ---
>
> Key: ARROW-1382
> URL: https://issues.apache.org/jira/browse/ARROW-1382
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> If a Python object appears multiple times within a list/tuple/dictionary, 
> then when pyarrow serializes the object, it will duplicate the object many 
> times. This leads to a potentially huge expansion in the size of the object 
> (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 
> times bigger than it needs to be).
> {code}
> import pyarrow as pa
> l = [0]
> original_object = [l, l]
> # Serialize and deserialize the object.
> buf = pa.serialize(original_object).to_buffer()
> new_object = pa.deserialize(buf)
> # This works.
> assert original_object[0] is original_object[1]
> # This fails.
> assert new_object[0] is new_object[1]
> {code}
> One potential way to address this is to use the Arrow dictionary encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1266) [Plasma] Move heap allocations to arrow memory pool

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1266:

Fix Version/s: (was: 0.14.0)

> [Plasma] Move heap allocations to arrow memory pool
> ---
>
> Key: ARROW-1266
> URL: https://issues.apache.org/jira/browse/ARROW-1266
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Reporter: Philipp Moritz
>Priority: Major
>
> At the moment we are allocating memory with std::vectors and even new in some 
> places, this should be cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1271) [Packaging] Build scripts for creating nightly conda-forge-compatible package builds

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1271:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Packaging] Build scripts for creating nightly conda-forge-compatible package 
> builds
> 
>
> Key: ARROW-1271
> URL: https://issues.apache.org/jira/browse/ARROW-1271
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.15.0
>
>
> cc [~cpcloud]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845947#comment-16845947
 ] 

Martin Durant commented on ARROW-5349:
--

>  in which this would be wrong if it is inside the file itself

 

Agreed, the path would be wrong. Even in the simpler case, above, you could say 
it was wrong based on the thrift template - and this could make sense, as it 
maybe implies opening a new file. 

 

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5218) [C++] Improve build when third-party library locations are specified

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5218.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4207
[https://github.com/apache/arrow/pull/4207]

> [C++] Improve build when third-party library locations are specified 
> -
>
> Key: ARROW-5218
> URL: https://issues.apache.org/jira/browse/ARROW-5218
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The current CMake build system does not handle user specified third-party 
> library locations well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1119) [Python/C++] Implement NativeFile interfaces for Amazon S3

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1119:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python/C++] Implement NativeFile interfaces for Amazon S3
> --
>
> Key: ARROW-1119
> URL: https://issues.apache.org/jira/browse/ARROW-1119
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> While we support HDFS and the local file system now, it would be nice to also 
> support S3 and eventually other cloud storage natively



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1231) [C++] Add filesystem / IO implementation for Google Cloud Storage

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1231:

Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Add filesystem / IO implementation for Google Cloud Storage
> -
>
> Key: ARROW-1231
> URL: https://issues.apache.org/jira/browse/ARROW-1231
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> See example jumping off point
> https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1089) [C++/Python] Add API to write an Arrow stream into either the stream or file formats on disk

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845934#comment-16845934
 ] 

Wes McKinney commented on ARROW-1089:
-

cc [~npr]

> [C++/Python] Add API to write an Arrow stream into either the stream or file 
> formats on disk
> 
>
> Key: ARROW-1089
> URL: https://issues.apache.org/jira/browse/ARROW-1089
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> For Arrow streams with unknown size, it would be useful to be able to write 
> the data to disk either as a stream or as the file format (for random access) 
> with minimal overhead; i.e. we would avoid record batch IPC loading and write 
> the raw messages directly to disk



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1013) [C++] Add asynchronous StreamWriter

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1013:

Fix Version/s: (was: 0.14.0)

> [C++] Add asynchronous StreamWriter
> ---
>
> Key: ARROW-1013
> URL: https://issues.apache.org/jira/browse/ARROW-1013
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> We may want to provide an option to limit the queuing depth. The async writer 
> can be initialized from a synchronous writer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-974) [Website] Add Use Cases section to the website

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-974:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [Website] Add Use Cases section to the website
> --
>
> Key: ARROW-974
> URL: https://issues.apache.org/jira/browse/ARROW-974
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> This will contain a list of "canonical use cases" for Arrow:
> * In-memory data structure for vectorized analytics / SIMD, or creating a 
> column-oriented analytic database system
> * Reading and writing columnar storage formats like Apache Parquet
> * Faster alternative to Thrift, Protobuf, or Avro in RPC
> * Shared memory IPC (zero-copy in-situ analytics)
> Any other ideas?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1042) [Python] C++ API plumbing for returning generic instance of ipc::RecordBatchReader to user

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1042:

Fix Version/s: (was: 0.14.0)

> [Python] C++ API plumbing for returning generic instance of 
> ipc::RecordBatchReader to user
> --
>
> Key: ARROW-1042
> URL: https://issues.apache.org/jira/browse/ARROW-1042
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>
> Currently we have no mechanism of wrapping a 
> {{std::shared_ptr}} like we do with some other 
> Arrow types



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1009) [C++] Create asynchronous version of StreamReader

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1009:

Fix Version/s: (was: 0.14.0)

> [C++] Create asynchronous version of StreamReader
> -
>
> Key: ARROW-1009
> URL: https://issues.apache.org/jira/browse/ARROW-1009
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> the {{AsyncStreamReader}} would buffer the next record batch in a background 
> thread, while emulating the current synchronous / blocking API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-973) [Website] Add FAQ page about project

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-973:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [Website] Add FAQ page about project
> 
>
> Key: ARROW-973
> URL: https://issues.apache.org/jira/browse/ARROW-973
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Website
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> As some suggested initial topics for the FAQ:
> * How Apache Arrow is related to Apache Parquet (the difference between a 
> "storage format" and an "in-memory format" causes confusion)
> * How is Arrow similar to / different from Flatbuffers and Cap'n Proto
> * How Arrow uses Flatbuffers (I have had people incorrectly state to me 
> things like "Arrow is just Flatbuffers under the hood")
> Any other ideas?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-823) [Python] Devise a means to serialize arrays of arbitrary Python objects in Arrow IPC messages

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-823:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Devise a means to serialize arrays of arbitrary Python objects in 
> Arrow IPC messages
> -
>
> Key: ARROW-823
> URL: https://issues.apache.org/jira/browse/ARROW-823
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Practically speaking, this would involve a "custom" logical type that is 
> "pyobject", represented physically as an array of 64-bit pointers. On 
> serialization, this would need to be converted to a BinaryArray containing 
> pickled objects as binary values
> At the moment, we don't yet have the machinery to deal with "custom" types 
> where the in-memory representation is different from the on-wire 
> representation. This would be a useful use case to work through the design 
> issues
> Interestingly, if done properly, this would enable other Arrow 
> implementations to manipulate (filter, etc.) serialized Python objects as 
> binary blobs. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-971:
---
Labels: dataframe  (was: pull-request-available)

> [C++/Python] Implement Array.isvalid/notnull/isnull
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: dataframe
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-721) [Java] Read and write record batches to shared memory

2019-05-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845929#comment-16845929
 ] 

Wes McKinney commented on ARROW-721:


[~siddteotia] is this something of interest for the next release to validate 
the new Java capabilities?

> [Java] Read and write record batches to shared memory
> -
>
> Key: ARROW-721
> URL: https://issues.apache.org/jira/browse/ARROW-721
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> It would be useful for a Java application to be able to read a record batch 
> as a set of memory mapped byte buffers given a file name and a memory address 
> for the metadata. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-653) [Python / C++] Add debugging function to print an array's buffer contents in hexadecimal

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-653:
--

Assignee: Anatoly Myachev

> [Python / C++] Add debugging function to print an array's buffer contents in 
> hexadecimal
> 
>
> Key: ARROW-653
> URL: https://issues.apache.org/jira/browse/ARROW-653
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Anatoly Myachev
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> This would help with debugging and illustrating the Arrow internals



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-517) [C++] Verbose Array::Equals

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-517:
---
Fix Version/s: (was: 0.14.0)

> [C++] Verbose Array::Equals
> ---
>
> Key: ARROW-517
> URL: https://issues.apache.org/jira/browse/ARROW-517
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Benjamin Kietzman
>Priority: Major
>
> In failing unit tests I often wished {{Array::Equals}} would tell me where 
> they aren't equal. This would save a lot of time in debugging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-488:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [Python] Implement conversion between integer coded as floating points with 
> NaN to an Arrow integer type
> 
>
> Key: ARROW-488
> URL: https://issues.apache.org/jira/browse/ARROW-488
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.15.0
>
>
> For example: if pandas has casted integer data to float, this would enable 
> the integer data to be recovered (so long as the values fall in the ~2^53 
> floating point range for exact integer representation)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-555:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.15.0
>
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-501) [C++] Implement concurrent / buffering InputStream for streaming data use cases

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-501:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Implement concurrent / buffering InputStream for streaming data use 
> cases
> ---
>
> Key: ARROW-501
> URL: https://issues.apache.org/jira/browse/ARROW-501
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: csv, filesystem, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Related to ARROW-500, when processing an input data stream, we may wish to 
> continue buffering input (up to an maximum buffer size) in between 
> synchronous Read calls



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-453) [C++] Add file interface implementations for Amazon S3

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-453:
---
Fix Version/s: (was: 0.14.0)
   0.15.0

> [C++] Add file interface implementations for Amazon S3
> --
>
> Key: ARROW-453
> URL: https://issues.apache.org/jira/browse/ARROW-453
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem
> Fix For: 0.15.0
>
>
> The BSD-licensed C++ code in SFrame 
> (https://github.com/turi-code/SFrame/tree/master/oss_src/fileio) may provide 
> some inspiration. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-258) [Format] clarify definition of Buffer in context of RPC, IPC, File

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-258.
--
Resolution: Won't Fix

The "page" field was remove from Buffer in 0.8.0 release

> [Format] clarify definition of Buffer in context of RPC, IPC, File
> --
>
> Key: ARROW-258
> URL: https://issues.apache.org/jira/browse/ARROW-258
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Julien Le Dem
>Priority: Major
> Fix For: 0.14.0
>
>
> currently Buffer has a loosely defined page field used for shared memory only.
> https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L109



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-473) [C++/Python] Add public API for retrieving block locations for a particular HDFS file

2019-05-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-473:
---
Fix Version/s: (was: 0.14.0)

> [C++/Python] Add public API for retrieving block locations for a particular 
> HDFS file
> -
>
> Key: ARROW-473
> URL: https://issues.apache.org/jira/browse/ARROW-473
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: filesystem, hdfs, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is necessary for applications looking to schedule data-local work. 
> libhdfs does not have APIs to request the block locations directly, so we 
> need to see if the {{hdfsGetHosts}} function will do what we need. For 
> libhdfs3 there is a public API function 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845920#comment-16845920
 ] 

Martin Durant commented on ARROW-5349:
--

It depends on what is passed back to the caller: just the metadata object, or 
some indication of which file it went into (sorry, I don't know the API that's 
being built exactly). If the caller defines which file to write to, it would 
seem reasonable to let it set this attribute on the metadata object before 
writing to `_metadata`. However, that might be muddied if partitioning is also 
happening upon write and you end up with multiple files for each piece.

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5393) [R] Add tests and example for read_parquet()

2019-05-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5393:
--
Labels: parquet pull-request-available  (was: parquet)

> [R] Add tests and example for read_parquet()
> 
>
> Key: ARROW-5393
> URL: https://issues.apache.org/jira/browse/ARROW-5393
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Minor
>  Labels: parquet, pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5393) [R] Add tests and example for read_parquet()

2019-05-22 Thread Neal Richardson (JIRA)
Neal Richardson created ARROW-5393:
--

 Summary: [R] Add tests and example for read_parquet()
 Key: ARROW-5393
 URL: https://issues.apache.org/jira/browse/ARROW-5393
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-412) [Format] Handling of buffer padding in the IPC metadata

2019-05-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-412:
-
Labels: pull-request-available  (was: )

> [Format] Handling of buffer padding in the IPC metadata
> ---
>
> Key: ARROW-412
> URL: https://issues.apache.org/jira/browse/ARROW-412
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> See discussion in ARROW-399. Do we include padding bytes in the metadata or 
> set the actual used bytes? In the latter case, the padding would be a part of 
> the format (any buffers continue to be expected to be 64-byte padded, to 
> permit AVX512 instructions)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Rick Zamora (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845908#comment-16845908
 ] 

Rick Zamora edited comment on ARROW-5349 at 5/22/19 2:17 PM:
-

Okay - the file path should not be set in the footer metadata (only in 
_metadata).  Does this mean that a mechanism for setting the file_path in C++ 
is completely unnecessary?  My understanding is that the motivation for this 
issue was to populate the file_path for the following step of writing the 
metadata file.  Is it sufficient to add a python-only mechanism to set the 
path? Or should we leave it up to the user to modify the metadata object 
themselves?


was (Author: rjzamora):
Okay - the file path should not be set in the footer metadata (only in 
_metadata).  Does this mean that a mechanism for setting the file_path in C++ 
is completely unnecessary?  My understanding is that the motivation for this 
issue was to populate the file_path for the following step of writing the 
metadata file.  Is it sufficient to add a python-only mechanism to set the path?

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Rick Zamora (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845908#comment-16845908
 ] 

Rick Zamora commented on ARROW-5349:


Okay - the file path should not be set in the footer metadata (only in 
_metadata).  Does this mean that a mechanism for setting the file_path in C++ 
is completely unnecessary?  My understanding is that the motivation for this 
issue was to populate the file_path for the following step of writing the 
metadata file.  Is it sufficient to add a python-only mechanism to set the path?

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845886#comment-16845886
 ] 

Martin Durant commented on ARROW-5349:
--

> I think it's acceptable to set the path in the file's internal metadata.

 

A library loading that data file in isolation can (and maybe *should*) be 
confused by this, though. Maybe that would not be typical operation, but we 
shouldn't preclude it.

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845885#comment-16845885
 ] 

Martin Durant commented on ARROW-5349:
--

Agreed on that last point, to let the caller set the path - if only because 
this is basically what fastparquet does.

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845883#comment-16845883
 ] 

Joris Van den Bossche commented on ARROW-5349:
--

Given that, for the API, it might make more sense to actually add a way to set 
the file path directly on the metadata object, instead of passing it to 
{{ParquetFileWriter}}. So that as a user of this API in python, you can set the 
path yourself on the metadata object that is returned by {{pq.ParquetWriter}} 
(which is appended to the metadata_collector).

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3822) [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on row groups with chunked columns

2019-05-22 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845880#comment-16845880
 ] 

Benjamin Kietzman commented on ARROW-3822:
--

[~wesmckinn] is this still an issue?

> [C++] parquet::arrow::FileReader::GetRecordBatchReader has logical error on 
> row groups with chunked columns
> ---
>
> Key: ARROW-3822
> URL: https://issues.apache.org/jira/browse/ARROW-3822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> If a BinaryArray / StringArray overflows a single column when reading a row 
> group, the resulting table will have a ChunkedArray. Using TableBatchReader 
> in 
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L176
> will therefore only return a part of the row group, discarding the rest



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845879#comment-16845879
 ] 

Joris Van den Bossche commented on ARROW-5349:
--

Thanks. It is actually also quite clear in the thrift file description of 
{{file_path}}: "File where column data is stored.  If not set, assumed to be 
same file as metadata.  This path is relative to the current file."

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-22 Thread Martin Durant (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845871#comment-16845871
 ] 

Martin Durant commented on ARROW-5349:
--

No, I don't have an explicit reference for this, and I believe I got the 
original model from spark (i.e., presumably same as hive), which I suppose 
would make it "common" by itself. I think it's the only thing that makes sense, 
since each data file should be readable in isolation, and there would be no way 
of knowing it was part of a collection and that the paths should therefore be 
ignored. At a guess, the design of the standard may have foreseen data-only 
chunk files, without footer information.

> [Python/C++] Provide a way to specify the file path in parquet 
> ColumnChunkMetaData
> --
>
> Key: ARROW-5349
> URL: https://issues.apache.org/jira/browse/ARROW-5349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now 
> possible to collect the file metadata while writing different files (then how 
> to write those metadata was not yet addressed -> original issue ARROW-1983).
> However, currently, the {{file_path}} information in the ColumnChunkMetaData 
> object is not set. This is, I think, expected / correct for the metadata as 
> included within the single file; but for using the metadata in the combined 
> dataset `_metadata`, it needs a file path set.
> So if you want to use this metadata for a partitioned dataset, there needs to 
> be a way to specify this file path. 
> Ideas I am thinking of currently: either, we could specify a file path to be 
> used when writing, or expose the `set_file_path` method on the Python side so 
> you can create an updated version of the metadata after collecting it.
> cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5381) Crash at arrow::internal::CountSetBits

2019-05-22 Thread Tham (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845741#comment-16845741
 ] 

Tham commented on ARROW-5381:
-

Thanks for quick response. I'll send this to my customer and ask them to run 
it. The response will be not as fast as you :)

> Crash at arrow::internal::CountSetBits
> --
>
> Key: ARROW-5381
> URL: https://issues.apache.org/jira/browse/ARROW-5381
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Operating System: Windows 7 Professional 64-bit (6.1, 
> Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429)
> Language: English (Regional Setting: English)
> System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
> System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
> BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
> Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
> Memory: 2048MB RAM
> Available OS Memory: 1962MB RAM
>   Page File: 1517MB used, 2405MB available
> Windows Dir: C:\Windows
> DirectX Version: DirectX 11
>Reporter: Tham
>Priority: Major
>
> I've got a lot of crash dump from a customer's windows machine. The 
> stacktrace shows that it crashed at arrow::internal::CountSetBits.
>  
> {code:java}
> STACK_TEXT:  
> 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` 
> `1e00 ` : 
> CortexService!arrow::internal::CountSetBits+0x16d
> 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` 
> ` ` : 
> CortexService!arrow::ArrayData::GetNullCount+0x8d
> 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 
> ` ` : 
> CortexService!arrow::Array::null_count+0x37
> 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::Visit >+0xa5
> 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 
> 00c9`5354ab40 ` : 
> CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298
> 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::VisitInline+0x44
> 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 
> 00c9`54476080 00c9`5354b208 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::GenerateLevels+0x93
> 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 
> 00c9`54476080 `1e00 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x25a
> 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 
> 00c9`54445c20 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x2a6
> 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
> 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
> 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 
> ` `1e00 : 
> CortexService!::operator()+0x195
> 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 
> 00c9`54442fb0 `1e00 : 
> CortexService!parquet::arrow::FileWriter::WriteTable+0x521
> 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 
> ` ` : 
> CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
> 00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 
> 00c9`5354b9e0 00c9`5354b9d8 : 
> CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545
> 00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 
> `fffe ` : 
> CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6
> {code}
> {code:java}
> FAILED_INSTRUCTION_ADDRESS: 
> CortexService!arrow::internal::CountSetBits+16d 
> [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc
>  @ 99]
> 7ff7`2f3a4e4d f3480fb800  popcnt  rax,qword ptr [rax]
> FOLLOWUP_IP: 
> CortexService!arrow::internal::CountSetBits+16d 
> [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc
>  @ 99]
> 7ff7`2f3a4e4d f3480fb800  popcnt  rax,qword ptr [rax]
> FAULTING_SOU

[jira] [Resolved] (ARROW-5389) [C++] Add an internal temporary directory API

2019-05-22 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5389.
---
Resolution: Fixed

Issue resolved by pull request 4364
[https://github.com/apache/arrow/pull/4364]

> [C++] Add an internal temporary directory API
> -
>
> Key: ARROW-5389
> URL: https://issues.apache.org/jira/browse/ARROW-5389
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This is needed to easily write tests involving filesystem operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4800) [C++] Create/port a StatusOr implementation to be able to return a status or a type

2019-05-22 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845686#comment-16845686
 ] 

Antoine Pitrou commented on ARROW-4800:
---

I would rather call this {{Result<>}}.

Ideally we would rewrite all Status-returning APIs to return a {{Result<>}} 
instead. Of course it's probably out of question (both because it breaks 
compatibility, and because of the huge hassle in refactoring all Arrow code).

> [C++] Create/port a StatusOr implementation to be able to return a status or 
> a type
> ---
>
> Key: ARROW-4800
> URL: https://issues.apache.org/jira/browse/ARROW-4800
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.14.0
>Reporter: Micah Kornfield
>Priority: Minor
>
> Example from grpc: 
> https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/stubs/statusor.h



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5381) Crash at arrow::internal::CountSetBits

2019-05-22 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845678#comment-16845678
 ] 

Antoine Pitrou commented on ARROW-5381:
---

Can you download and run this program:
https://docs.microsoft.com/en-us/sysinternals/downloads/coreinfo

Among its output will be a line saying "Supports POPCNT instruction". It will 
tell you whether the CPU supports the required instruction.

> Crash at arrow::internal::CountSetBits
> --
>
> Key: ARROW-5381
> URL: https://issues.apache.org/jira/browse/ARROW-5381
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Operating System: Windows 7 Professional 64-bit (6.1, 
> Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429)
> Language: English (Regional Setting: English)
> System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
> System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
> BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
> Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
> Memory: 2048MB RAM
> Available OS Memory: 1962MB RAM
>   Page File: 1517MB used, 2405MB available
> Windows Dir: C:\Windows
> DirectX Version: DirectX 11
>Reporter: Tham
>Priority: Major
>
> I've got a lot of crash dump from a customer's windows machine. The 
> stacktrace shows that it crashed at arrow::internal::CountSetBits.
>  
> {code:java}
> STACK_TEXT:  
> 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` 
> `1e00 ` : 
> CortexService!arrow::internal::CountSetBits+0x16d
> 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` 
> ` ` : 
> CortexService!arrow::ArrayData::GetNullCount+0x8d
> 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 
> ` ` : 
> CortexService!arrow::Array::null_count+0x37
> 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::Visit >+0xa5
> 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 
> 00c9`5354ab40 ` : 
> CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298
> 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::VisitInline+0x44
> 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 
> 00c9`54476080 00c9`5354b208 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::GenerateLevels+0x93
> 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 
> 00c9`54476080 `1e00 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x25a
> 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 
> 00c9`54445c20 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x2a6
> 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
> 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
> 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 
> ` `1e00 : 
> CortexService!::operator()+0x195
> 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 
> 00c9`54442fb0 `1e00 : 
> CortexService!parquet::arrow::FileWriter::WriteTable+0x521
> 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 
> ` ` : 
> CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
> 00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 
> 00c9`5354b9e0 00c9`5354b9d8 : 
> CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545
> 00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 
> `fffe ` : 
> CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6
> {code}
> {code:java}
> FAILED_INSTRUCTION_ADDRESS: 
> CortexService!arrow::internal::CountSetBits+16d 
> [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util\bit-util.cc
>  @ 99]
> 7ff7`2f3a4e4d f3480fb800  popcnt  rax,qword ptr [rax]
> FOLLOWUP_IP: 
> CortexService!arrow::internal::CountSetBits+16d 
> [c:\jenkins\workspace\cortexv2-dev-win64-service\

[jira] [Commented] (ARROW-5381) Crash at arrow::internal::CountSetBits

2019-05-22 Thread Tham (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845673#comment-16845673
 ] 

Tham commented on ARROW-5381:
-

> Are you running this in a VM?

No, it's not a virtual machine.

I've got another machine which has the same crash:
{code:java}
Operating System: Windows 10 Pro 64-bit (10.0, Build 10240) 
(10240.th1.170602-2340)

 Language: English (Regional Setting: English)

  System Manufacturer: HP

 System Model: HP Laptop 14-bs0xx

 BIOS: F.31

    Processor: Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz (2 CPUs), 
~1.6GHz

   Memory: 4096MB RAM

  Available OS Memory: 4002MB RAM

    Page File: 2189MB used, 2516MB available

  Windows Dir: C:\Windows

  DirectX Version: 12

  DX Setup Parameters: Not found

 User DPI Setting: Using System DPI

   System DPI Setting: 96 DPI (100 percent)

  DWM DPI Scaling: Disabled

 Miracast: Available, with HDCP

Microsoft Graphics Hybrid: Not Supported

   DxDiag Version: 10.00.10240.16384 64bit Unicode
{code}
Can you please take a look?

> Crash at arrow::internal::CountSetBits
> --
>
> Key: ARROW-5381
> URL: https://issues.apache.org/jira/browse/ARROW-5381
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Operating System: Windows 7 Professional 64-bit (6.1, 
> Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429)
> Language: English (Regional Setting: English)
> System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
> System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
> BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
> Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
> Memory: 2048MB RAM
> Available OS Memory: 1962MB RAM
>   Page File: 1517MB used, 2405MB available
> Windows Dir: C:\Windows
> DirectX Version: DirectX 11
>Reporter: Tham
>Priority: Major
>
> I've got a lot of crash dump from a customer's windows machine. The 
> stacktrace shows that it crashed at arrow::internal::CountSetBits.
>  
> {code:java}
> STACK_TEXT:  
> 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` 
> `1e00 ` : 
> CortexService!arrow::internal::CountSetBits+0x16d
> 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` 
> ` ` : 
> CortexService!arrow::ArrayData::GetNullCount+0x8d
> 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 
> ` ` : 
> CortexService!arrow::Array::null_count+0x37
> 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::Visit >+0xa5
> 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 
> 00c9`5354ab40 ` : 
> CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298
> 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::VisitInline+0x44
> 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 
> 00c9`54476080 00c9`5354b208 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::GenerateLevels+0x93
> 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 
> 00c9`54476080 `1e00 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x25a
> 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 
> 00c9`54445c20 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x2a6
> 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
> 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
> 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 
> ` `1e00 : 
> CortexService!::operator()+0x195
> 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 
> 00c9`54442fb0 `1e00 : 
> CortexService!parquet::arrow::FileWriter::WriteTable+0x521
> 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 
> ` ` : 
> CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
> 00

  1   2   >