[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-18 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933039#comment-16933039
 ] 

Suvayu Ali commented on ARROW-4930:
---

I have some time this weekend, I'll have a go at it.

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
> Fix For: 2.0.0
>
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6599) [Rust] [DataFusion] Implement SUM aggregate expression

2019-09-18 Thread Andy Grove (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-6599.
---
Resolution: Fixed

Issue resolved by pull request 5419
[https://github.com/apache/arrow/pull/5419]

> [Rust] [DataFusion] Implement SUM aggregate expression
> --
>
> Key: ARROW-6599
> URL: https://issues.apache.org/jira/browse/ARROW-6599
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Implement the SUM aggregate function in the new physical query plan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6566) Implement VarChar in Scala

2019-09-18 Thread Liya Fan (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933018#comment-16933018
 ] 

Liya Fan commented on ARROW-6566:
-

I want to take a look at this issue late today. If the problem is not solved by 
then.

> Implement VarChar in Scala
> --
>
> Key: ARROW-6566
> URL: https://issues.apache.org/jira/browse/ARROW-6566
> Project: Apache Arrow
>  Issue Type: Test
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Boris V.Kuznetsov
>Priority: Major
>
> Hello
> I'm trying to write and read a zio.Chunk of strings, with is essentially an 
> array of strings.
> My implementation fails the test, how should I fix that ?
> [Writer|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/main/scala/zio/serdes/arrow/ArrowUtils.scala#L48]
>  code
> [Reader|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/main/scala/zio/serdes/arrow/ArrowUtils.scala#L108]
>  code
> [Test|https://github.com/Neurodyne/zio-serdes/blob/9e2128ff64ffa0e7c64167a5ee46584c3fcab9e4/src/test/scala/arrow/Base.scala#L115]
>  code
> Any help, links and advice are highly appreciated
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-2317) [Python] fix C linkage warning

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2317.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5425
[https://github.com/apache/arrow/pull/5425]

> [Python] fix C linkage warning
> --
>
> Key: ARROW-2317
> URL: https://issues.apache.org/jira/browse/ARROW-2317
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Viktor Gal
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When using pyarrow interface from a c++ library one will get the following 
> compiler warning:
> {quote}{{warning: 'unwrap_table' has C-linkage specified, but returns 
> user-defined type 'arrow::Status' which is incompatible with C 
> [-Wreturn-type-c-linkage]}}
> {{ARROW_EXPORT Status unwrap_table(PyObject* table, std::shared_ptr* 
> out);}}
> {quote}
> This is due to a Cython artifact.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6573) [Python] Segfault when writing to parquet

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-6573:

Summary: [Python] Segfault when writing to parquet  (was: Segfault when 
writing to parquet)

> [Python] Segfault when writing to parquet
> -
>
> Key: ARROW-6573
> URL: https://issues.apache.org/jira/browse/ARROW-6573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. 
> Using Anaconda distribution of Python 3.7. 
>Reporter: Josh Weinstock
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When attempting to write out a pyarrow table to parquet I am observing a 
> segfault when there is a mismatch between the schema and the datatypes. 
> Here is a reproducible example:
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet") 
> {code}
> This results in a segfault when writing the table. Running 
>  
> {code:java}
> gdb -ex r --args python test.py 
> {code}
> Yields
>  
>  
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x7fffe8173917 in 
> virtual thunk to 
> parquet::DictEncoderImpl 
> >::Put(parquet::ByteArray const*, int) () from 
> /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>  
>  
> Thanks for all of your arrow work,
> Josh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6573) [Python] Segfault when writing to parquet

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6573:
---

Assignee: Wes McKinney

> [Python] Segfault when writing to parquet
> -
>
> Key: ARROW-6573
> URL: https://issues.apache.org/jira/browse/ARROW-6573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. 
> Using Anaconda distribution of Python 3.7. 
>Reporter: Josh Weinstock
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When attempting to write out a pyarrow table to parquet I am observing a 
> segfault when there is a mismatch between the schema and the datatypes. 
> Here is a reproducible example:
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet") 
> {code}
> This results in a segfault when writing the table. Running 
>  
> {code:java}
> gdb -ex r --args python test.py 
> {code}
> Yields
>  
>  
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x7fffe8173917 in 
> virtual thunk to 
> parquet::DictEncoderImpl 
> >::Put(parquet::ByteArray const*, int) () from 
> /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>  
>  
> Thanks for all of your arrow work,
> Josh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6573) Segfault when writing to parquet

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6573.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5426
[https://github.com/apache/arrow/pull/5426]

> Segfault when writing to parquet
> 
>
> Key: ARROW-6573
> URL: https://issues.apache.org/jira/browse/ARROW-6573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. 
> Using Anaconda distribution of Python 3.7. 
>Reporter: Josh Weinstock
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When attempting to write out a pyarrow table to parquet I am observing a 
> segfault when there is a mismatch between the schema and the datatypes. 
> Here is a reproducible example:
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet") 
> {code}
> This results in a segfault when writing the table. Running 
>  
> {code:java}
> gdb -ex r --args python test.py 
> {code}
> Yields
>  
>  
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x7fffe8173917 in 
> virtual thunk to 
> parquet::DictEncoderImpl 
> >::Put(parquet::ByteArray const*, int) () from 
> /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>  
>  
> Thanks for all of your arrow work,
> Josh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5717) [Python] Support dictionary unification when converting variable dictionaries to pandas

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932964#comment-16932964
 ] 

Wes McKinney commented on ARROW-5717:
-

I'm just barely going to be able to get this done in time for 0.15.0

> [Python] Support dictionary unification when converting variable dictionaries 
> to pandas
> ---
>
> Key: ARROW-5717
> URL: https://issues.apache.org/jira/browse/ARROW-5717
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Follow up work to ARROW-5335



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5717) [Python] Support dictionary unification when converting variable dictionaries to pandas

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5717:
---

Assignee: Wes McKinney

> [Python] Support dictionary unification when converting variable dictionaries 
> to pandas
> ---
>
> Key: ARROW-5717
> URL: https://issues.apache.org/jira/browse/ARROW-5717
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Follow up work to ARROW-5335



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5343) [C++] Consider using Buffer for transpose maps in DictionaryType::Unify instead of std::vector

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5343:
--
Labels: pull-request-available  (was: )

> [C++] Consider using Buffer for transpose maps in DictionaryType::Unify 
> instead of std::vector
> --
>
> Key: ARROW-5343
> URL: https://issues.apache.org/jira/browse/ARROW-5343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> In the spirit of "track all the allocations", if dictionaries have 
> non-trivial length, we may want to account for this memory more precisely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6616) [Website] Release annoucement blog post for 0.15

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6616:
--
Labels: pull-request-available  (was: )

> [Website] Release annoucement blog post for 0.15
> 
>
> Key: ARROW-6616
> URL: https://issues.apache.org/jira/browse/ARROW-6616
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6616) [Website] Release announcement blog post for 0.15

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6616:
---
Summary: [Website] Release announcement blog post for 0.15  (was: [Website] 
Release annoucement blog post for 0.15)

> [Website] Release announcement blog post for 0.15
> -
>
> Key: ARROW-6616
> URL: https://issues.apache.org/jira/browse/ARROW-6616
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Website
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6616) [Website] Release annoucement blog post for 0.15

2019-09-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6616:
--

 Summary: [Website] Release annoucement blog post for 0.15
 Key: ARROW-6616
 URL: https://issues.apache.org/jira/browse/ARROW-6616
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6235) [R] Conversion from arrow::BinaryArray to R character vector not implemented

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6235:
---
Fix Version/s: (was: 0.15.0)
   1.0.0

> [R] Conversion from arrow::BinaryArray to R character vector not implemented
> 
>
> Key: ARROW-6235
> URL: https://issues.apache.org/jira/browse/ARROW-6235
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See unhandled case at 
> https://github.com/apache/arrow/blob/master/r/src/array__to_vector.cpp#L644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6235) [R] Conversion from arrow::BinaryArray to R character vector not implemented

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6235:
---
Description: 
See unhandled case at 

https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L644

  was:
See unhandled case at 

https://github.com/apache/arrow/blob/master/r/src/array__to_vector.cpp#L644


> [R] Conversion from arrow::BinaryArray to R character vector not implemented
> 
>
> Key: ARROW-6235
> URL: https://issues.apache.org/jira/browse/ARROW-6235
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> See unhandled case at 
> https://github.com/apache/arrow/blob/master/r/src/array_to_vector.cpp#L644



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6539) [R] Provide mechanism to write out old format

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6539:
--

Assignee: Romain François  (was: Neal Richardson)

> [R] Provide mechanism to write out old format
> -
>
> Key: ARROW-6539
> URL: https://issues.apache.org/jira/browse/ARROW-6539
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Neal Richardson
>Assignee: Romain François
>Priority: Critical
> Fix For: 0.15.0
>
>
> See ARROW-6474. {{sparklyr}} will have the same issue so we should make sure 
> this is supported in R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6539) [R] Provide mechanism to write out old format

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6539:
--

Assignee: Neal Richardson  (was: Romain François)

> [R] Provide mechanism to write out old format
> -
>
> Key: ARROW-6539
> URL: https://issues.apache.org/jira/browse/ARROW-6539
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Critical
> Fix For: 0.15.0
>
>
> See ARROW-6474. {{sparklyr}} will have the same issue so we should make sure 
> this is supported in R.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5343) [C++] Consider using Buffer for transpose maps in DictionaryType::Unify instead of std::vector

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5343:
---

Assignee: Wes McKinney

> [C++] Consider using Buffer for transpose maps in DictionaryType::Unify 
> instead of std::vector
> --
>
> Key: ARROW-5343
> URL: https://issues.apache.org/jira/browse/ARROW-5343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> In the spirit of "track all the allocations", if dictionaries have 
> non-trivial length, we may want to account for this memory more precisely. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6501) [C++] Remove non_zero_length field from SparseIndex

2019-09-18 Thread Kenta Murata (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932905#comment-16932905
 ] 

Kenta Murata commented on ARROW-6501:
-

No, this doesn’t block 0.15.0.

> [C++] Remove non_zero_length field from SparseIndex
> ---
>
> Key: ARROW-6501
> URL: https://issues.apache.org/jira/browse/ARROW-6501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
> Fix For: 0.15.0
>
>
> We can remove non_zero_length field from SparseIndex because it can be 
> supplied from the shape of the indices tensor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6590) [C++] Do not require ARROW_JSON=ON when ARROW_IPC=ON

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6590.
-
Resolution: Fixed

Issue resolved by pull request 5428
[https://github.com/apache/arrow/pull/5428]

> [C++] Do not require ARROW_JSON=ON when ARROW_IPC=ON
> 
>
> Key: ARROW-6590
> URL: https://issues.apache.org/jira/browse/ARROW-6590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> arrow/CMakeLists.txt currently has
> {code}
> if(ARROW_IPC AND NOT ARROW_JSON)
>   message(FATAL_ERROR "JSON support is required for Arrow IPC")
> endif()
> {code}
> Building the JSON scanner component should not be a pre-requisite of building 
> IPC support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-3298) [C++] Move murmur3 hash implementation to arrow/util

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3298.
---
Resolution: Won't Fix

> [C++] Move murmur3 hash implementation to arrow/util
> 
>
> Key: ARROW-3298
> URL: https://issues.apache.org/jira/browse/ARROW-3298
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It would be good to consolidate hashing utility code in a central place (this 
> is currently in src/parquet)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3298) [C++] Move murmur3 hash implementation to arrow/util

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932892#comment-16932892
 ] 

Wes McKinney commented on ARROW-3298:
-

Parquet has also dropped murmurhash from the Bloom filter implementation 
https://github.com/apache/parquet-format/commit/8f1783ec0b273e89c884b46c0f527d0a48321826#diff-d96aef0e8954afde569c8b40b8748081.
 So I'll close this one

> [C++] Move murmur3 hash implementation to arrow/util
> 
>
> Key: ARROW-3298
> URL: https://issues.apache.org/jira/browse/ARROW-3298
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> It would be good to consolidate hashing utility code in a central place (this 
> is currently in src/parquet)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6501) [C++] Remove non_zero_length field from SparseIndex

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932888#comment-16932888
 ] 

Wes McKinney commented on ARROW-6501:
-

Is this still a priority for 0.15.0?

> [C++] Remove non_zero_length field from SparseIndex
> ---
>
> Key: ARROW-6501
> URL: https://issues.apache.org/jira/browse/ARROW-6501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kenta Murata
>Assignee: Kenta Murata
>Priority: Major
> Fix For: 0.15.0
>
>
> We can remove non_zero_length field from SparseIndex because it can be 
> supplied from the shape of the indices tensor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932884#comment-16932884
 ] 

Antoine Pitrou commented on ARROW-5086:
---

Also, see ARROW-465. Not sure this requires a fix, though it's definitely 
surprising.

> [Python] Space leak in  ParquetFile.read_row_group()
> 
>
> Key: ARROW-5086
> URL: https://issues.apache.org/jira/browse/ARROW-5086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Jakub Okoński
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
> Attachments: all.png, all.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is 
> collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the 
> code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one 
> row group at a time, converting to pandas and back to arrow on the writer 
> path. Line in black is the first one, using a single reader object. Blue is 
> instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6562) [GLib] Fix wrong sliced data of GArrowBuffer

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-6562:
---
Component/s: GLib

> [GLib] Fix wrong sliced data of GArrowBuffer
> 
>
> Key: ARROW-6562
> URL: https://issues.apache.org/jira/browse/ARROW-6562
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Sutou Kouhei
>Assignee: Sutou Kouhei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5086:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Space leak in  ParquetFile.read_row_group()
> 
>
> Key: ARROW-5086
> URL: https://issues.apache.org/jira/browse/ARROW-5086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Jakub Okoński
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
> Attachments: all.png, all.png
>
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is 
> collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the 
> code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one 
> row group at a time, converting to pandas and back to arrow on the writer 
> path. Line in black is the first one, using a single reader object. Blue is 
> instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932883#comment-16932883
 ] 

Antoine Pitrou commented on ARROW-5086:
---

I would be surprised if the kernel copies the memory into RSS. I think the 
paged in memory simply accounts into RSS.

As long as there is no memory pressure, the kernel probably doesn't feel any 
need to page out those pages.

> [Python] Space leak in  ParquetFile.read_row_group()
> 
>
> Key: ARROW-5086
> URL: https://issues.apache.org/jira/browse/ARROW-5086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Jakub Okoński
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.15.0
>
> Attachments: all.png, all.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is 
> collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the 
> code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one 
> row group at a time, converting to pandas and back to arrow on the writer 
> path. Line in black is the first one, using a single reader object. Blue is 
> instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5673) [Crossbow] Support GitLab runners

2019-09-18 Thread Sutou Kouhei (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932882#comment-16932882
 ] 

Sutou Kouhei commented on ARROW-5673:
-

GitHub Actions doesn't support self-hosted workers yet. GitLab supports it.

If we can use self-hosted workers, we can run tests that require GPU, ARM and 
so on by CI.

> [Crossbow] Support GitLab runners
> -
>
> Key: ARROW-5673
> URL: https://issues.apache.org/jira/browse/ARROW-5673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Description is by [~kou]:
> I want to use GitLab Runner instead of CircleCI.
> Because we can add custom GitLab Runners for us. For example, we can add GPU 
> enabled GitLab Runner to test CUDA enabled Apache Arrow build. We can also 
> increase timeout more than 5h for our GitLab Runners.
> We can use https://gitlab.com/ to run GitLab Runners: 
> https://about.gitlab.com/solutions/github/
> This feature isn't included in the Free tier on GitLab.com (it's available 
> with the Free tier for campaing for now (*1)) but GitLab.com provides Gold 
> tier features to open source projects (*2). So we can use this feature by 
> choosing "CI/CD for external repo" in "New project page" 
> https://gitlab.com/projects/new .
> (*1)
> So, for the next year we are making the GitLab CI/CD for GitHub feature a 
> part of our GitLab.com Free tier.
> (*2)
> As part of our commitment to open source, we offer all public projects 
> our highest tier features (Gold) for free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932875#comment-16932875
 ] 

Wes McKinney commented on ARROW-5086:
-

OK, I figured it out. I changed the reader call to

{code}
reader = pq.ParquetFile('/tmp/test.parquet', memory_map=False)
{code}

and the memory increase goes away. 

It seems that when we access parts of the memory map, the kernel copies the 
accessed memory into RSS. 

> [Python] Space leak in  ParquetFile.read_row_group()
> 
>
> Key: ARROW-5086
> URL: https://issues.apache.org/jira/browse/ARROW-5086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Jakub Okoński
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: all.png, all.png
>
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is 
> collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the 
> code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one 
> row group at a time, converting to pandas and back to arrow on the writer 
> path. Line in black is the first one, using a single reader object. Blue is 
> instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932871#comment-16932871
 ] 

Wes McKinney commented on ARROW-5086:
-

I've been looking at this for about an hour. This is really strange, here is my 
example code I'm using to investigate:

https://gist.github.com/wesm/27a1c65aa8329855ff80dd0157553fa5

here is the output

https://gist.github.com/wesm/8ad9f224b64862ca31c28183effa82b4

weirdly on each iteration, RSS goes up by about ~8MB which is the amount of 
Arrow memory allocated on each iteration, even though the memory pool is 
claiming that the memory is being released. But then once the file reader 
object goes out of scope, RSS is released in bulk. 

I suspect that there is a rogue heap allocation someplace but I haven't found 
it yet. I checked that the destructors in the various C++ objects are firing on 
each iteration and no dice yet

> [Python] Space leak in  ParquetFile.read_row_group()
> 
>
> Key: ARROW-5086
> URL: https://issues.apache.org/jira/browse/ARROW-5086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Jakub Okoński
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: all.png, all.png
>
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is 
> collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the 
> code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one 
> row group at a time, converting to pandas and back to arrow on the writer 
> path. Line in black is the first one, using a single reader object. Blue is 
> instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6615) [C++] Add filtering option to fs::Selector

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6615:
-

 Summary: [C++] Add filtering option to fs::Selector
 Key: ARROW-6615
 URL: https://issues.apache.org/jira/browse/ARROW-6615
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques


It would convenient if Selector could support file path filtering, either via a 
regex or globbing applied to the path.

This is semi required for filtering file in Dataset to properly apply the file 
format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6378) [C++][Dataset] Implement TreeDataSource

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6378:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Implement TreeDataSource
> ---
>
> Key: ARROW-6378
> URL: https://issues.apache.org/jira/browse/ARROW-6378
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
>
> The TreeDataSource is required to support partitions pruning of sub-trees.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6614) [C++][Dataset] Implement FileSystemDataSourceDiscovery

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6614:
-

 Summary: [C++][Dataset] Implement FileSystemDataSourceDiscovery
 Key: ARROW-6614
 URL: https://issues.apache.org/jira/browse/ARROW-6614
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques


DataSourceDiscovery is what allows InferingSchema and constructing a DataSource 
with PartitionScheme.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-5086:
---

Assignee: Wes McKinney

> [Python] Space leak in  ParquetFile.read_row_group()
> 
>
> Key: ARROW-5086
> URL: https://issues.apache.org/jira/browse/ARROW-5086
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Jakub Okoński
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.15.0
>
> Attachments: all.png, all.png
>
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is 
> collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the 
> code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one 
> row group at a time, converting to pandas and back to arrow on the writer 
> path. Line in black is the first one, using a single reader object. Blue is 
> instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932827#comment-16932827
 ] 

Joris Van den Bossche commented on ARROW-1664:
--

In general, xarray datasets/dataarrays do not necessarily match Arrow's data 
model (eg they can have multiple dimensions). Of course, you can have a subset 
of cases where your xarray object would map nicely to an Arrow table.  
Also, given that xarray uses contiguous numpy arrays and Arrow 1D arrays, I am 
not sure that Arrow is very suited for zero-copy serialization for such 
objects? (converting to arrow could be zero-copy, but not the other way around?)

So given that, I am not sure pyarrow should necessarily support xarray objects 
specifically. 
We could indeed think about a "table protocol", but for that I think it would 
be nice to have some more practical use cases.


> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-1564) [C++] Kernel functions for computing minimum and maximum of an array in one pass

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1564.
---

> [C++] Kernel functions for computing minimum and maximum of an array in one 
> pass
> 
>
> Key: ARROW-1564
> URL: https://issues.apache.org/jira/browse/ARROW-1564
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is useful for determining whether a small-range integer O( n ) sort can 
> be used in some circumstances. Can also be used for simply computing array 
> statistics



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-2200) [C++] Investigate use of google/crc32c vs xxhash in Plasma

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-2200.
---

> [C++] Investigate use of google/crc32c vs xxhash in Plasma
> --
>
> Key: ARROW-2200
> URL: https://issues.apache.org/jira/browse/ARROW-2200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Plasma
>Reporter: Zongheng Yang
>Priority: Major
>
> I'm looking into Plasma's use of XXH64 hash library, and whether we can 
> replace it with google/crc32c.
>  
> Here's my build 
> [change|https://github.com/concretevitamin/arrow/commit/e4abaddf55255bf2e773b1094287bfd99a6dfb69].
>  
>  
> With this change, for some reason, the plasma_static library did NOT get 
> linked into libcrc32c.a (which is successfully built), whereas plasma_shared 
> and plasma_store did link with it:
>  
> ---
>  » tail ./src/plasma/CMakeFiles/plasma_\{static,shared,store}.dir/link.txt
> ==> ./src/plasma/CMakeFiles/plasma_static.dir/link.txt <==
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ar
>  qc ../../release/libplasma.a  CMakeFiles/plasma_objlib.dir/client.cc.o 
> CMakeFiles/plasma_objlib.dir/common.cc.o 
> CMakeFiles/plasma_objlib.dir/eviction_policy.cc.o 
> CMakeFiles/plasma_objlib.dir/events.cc.o 
> CMakeFiles/plasma_objlib.dir/fling.cc.o CMakeFiles/plasma_objlib.dir/io.cc.o 
> CMakeFiles/plasma_objlib.dir/malloc.cc.o 
> CMakeFiles/plasma_objlib.dir/plasma.cc.o 
> CMakeFiles/plasma_objlib.dir/protocol.cc.o 
> CMakeFiles/plasma_objlib.dir/thirdparty/ae/ae.c.o 
> CMakeFiles/plasma_objlib.dir/thirdparty/xxhash.cc.o
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib
>  ../../release/libplasma.a
>  
> ==> ./src/plasma/CMakeFiles/plasma_shared.dir/link.txt <==
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>  -g -O3 -O3 -DNDEBUG  -Wall -std=c++11 -msse3 -stdlib=libc++  
> -Qunused-arguments  -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200809L -fPIC -O3 
> -DNDEBUG -dynamiclib -Wl,-headerpad_max_install_names -undefined 
> dynamic_lookup  -o ../../release/libplasma.0.0.0.dylib -install_name 
> @rpath/libplasma.0.dylib 
> ...
> ../../crc32c_ep/src/crc32c_ep-install/lib/libcrc32c.a 
> ../../release/libarrow.a /usr/lib/libpthread.dylib 
> /usr/local/lib/libboost_system-mt.a /usr/local/lib/libboost_filesystem-mt.a
>  
> ==> ./src/plasma/CMakeFiles/plasma_store.dir/link.txt <==
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
>   -g -O3 -O3 -DNDEBUG  -Wall -std=c++11 -msse3 -stdlib=libc++  
> -Qunused-arguments  -D_XOPEN_SOURCE=500 -D_POSIX_C_SOURCE=200809L -fPIC -O3 
> -DNDEBUG -Wl,-search_paths_first -Wl,-headerpad_max_install_names  
> CMakeFiles/plasma_store.dir/store.cc.o  -o ../../release/plasma_store 
> ../../release/libplasma.a 
> ../../crc32c_ep/src/crc32c_ep-install/lib/libcrc32c.a 
> ../../release/libarrow.a /usr/lib/libpthread.dylib 
> /usr/local/lib/libboost_system-mt.a /usr/local/lib/libboost_filesystem-mt.a
> ---
>  
> Do you see what's going on?  What am I doing wrong to not have 
> "plasma_static" depend on "crc32c_ep"?
>  
> Any advice will be greatly appreciated,
> Zongheng



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3509) [C++] Inconsistent child accessor naming

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932825#comment-16932825
 ] 

Wes McKinney commented on ARROW-3509:
-

Yeah, I think {{StructType::field}} would be better

> [C++] Inconsistent child accessor naming
> 
>
> Key: ARROW-3509
> URL: https://issues.apache.org/jira/browse/ARROW-3509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Minor
> Fix For: 1.0.0
>
>
> {code}
> UnionArray::child
> StructArray::field
> DataType::child
> StructType::child
> Schema::field
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-983) [C++] Implement InputStream and OutputStream classes for interacting with socket connections

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-983.
--
Resolution: Won't Fix

Let's revisit when and if it is actually needed. 

> [C++] Implement InputStream and OutputStream classes for interacting with 
> socket connections
> 
>
> Key: ARROW-983
> URL: https://issues.apache.org/jira/browse/ARROW-983
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6555) [C++] Implement benchmarks for BufferedInputStream / BufferedOutputStream

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-6555.
---
Fix Version/s: (was: 1.0.0)
   Resolution: Not A Problem

> [C++] Implement benchmarks for BufferedInputStream / BufferedOutputStream
> -
>
> Key: ARROW-6555
> URL: https://issues.apache.org/jira/browse/ARROW-6555
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> See discussion in ARROW-4220



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932817#comment-16932817
 ] 

Antoine Pitrou commented on ARROW-2196:
---

If we avoid other usages of boost::filesystem then it should be relatively easy 
to remove uses in arrow::fs as well.

> [C++] Consider quarantining platform code with dependency on non-header Boost 
> code
> --
>
> Key: ARROW-2196
> URL: https://issues.apache.org/jira/browse/ARROW-2196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> see discussion in ARROW-2193 for the motivation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5377) [C++] Develop interface for writing a RecordBatch IPC stream into pre-allocated space (e.g. memory map) that avoids unnecessary serialization

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5377:

Fix Version/s: 1.0.0

> [C++] Develop interface for writing a RecordBatch IPC stream into 
> pre-allocated space (e.g. memory map) that avoids unnecessary serialization
> -
>
> Key: ARROW-5377
> URL: https://issues.apache.org/jira/browse/ARROW-5377
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> As discussed in recent mailing list thread
> https://lists.apache.org/thread.html/b756209052fecb8c28a5eb37db7aecb82a5f5351fa79a9d86f0dba3e@%3Cuser.arrow.apache.org%3E
> The only viable process at the moment for getting an accurate report of 
> stream size is to write a simulated stream using {{MockOutputStream}}. This 
> is suboptimal for a couple of reasons:
> * Flatbuffers metadata must be created twice
> * Record batch disassembly into IpcPayload must be performed twice
> It seems like an interface with a very constrained public API could be 
> provided to deconstruct a sequence of RecordBatches and report the size of 
> the produced IPC stream (based on metadata sizes, and padding), and then this 
> deconstructed set of IPC payloads can be written out to a stream (e.g. using 
> {{FixedSizeBufferWriter}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-5377) [C++] Develop interface for writing a RecordBatch IPC stream into pre-allocated space (e.g. memory map) that avoids unnecessary serialization

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932814#comment-16932814
 ] 

Wes McKinney edited comment on ARROW-5377 at 9/18/19 8:40 PM:
--

This is still incomplete -- having {{IpcPayload}} has gotten us most of the way 
there. I think we only need to implement a function to return the exact 
encapsulated message size given an {{IpcPayload}}, so that an appropriate piece 
of memory can be allocated. 


was (Author: wesmckinn):
This is still incomplete -- having {{IpcPayload} has gotten us most of the way 
there. I think we only need to implement a function to return the exact 
encapsulated message size given an {{IpcPayload}}, so that an appropriate piece 
of memory can be allocated. 

> [C++] Develop interface for writing a RecordBatch IPC stream into 
> pre-allocated space (e.g. memory map) that avoids unnecessary serialization
> -
>
> Key: ARROW-5377
> URL: https://issues.apache.org/jira/browse/ARROW-5377
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> As discussed in recent mailing list thread
> https://lists.apache.org/thread.html/b756209052fecb8c28a5eb37db7aecb82a5f5351fa79a9d86f0dba3e@%3Cuser.arrow.apache.org%3E
> The only viable process at the moment for getting an accurate report of 
> stream size is to write a simulated stream using {{MockOutputStream}}. This 
> is suboptimal for a couple of reasons:
> * Flatbuffers metadata must be created twice
> * Record batch disassembly into IpcPayload must be performed twice
> It seems like an interface with a very constrained public API could be 
> provided to deconstruct a sequence of RecordBatches and report the size of 
> the produced IPC stream (based on metadata sizes, and padding), and then this 
> deconstructed set of IPC payloads can be written out to a stream (e.g. using 
> {{FixedSizeBufferWriter}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932813#comment-16932813
 ] 

Antoine Pitrou commented on ARROW-1664:
---

Ah, perhaps at some point we want to define a PyArrow table protocol like we 
already have a PyArrow array protocol. [~jorisvandenbossche] what do you think?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5377) [C++] Develop interface for writing a RecordBatch IPC stream into pre-allocated space (e.g. memory map) that avoids unnecessary serialization

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932814#comment-16932814
 ] 

Wes McKinney commented on ARROW-5377:
-

This is still incomplete -- having {{IpcPayload} has gotten us most of the way 
there. I think we only need to implement a function to return the exact 
encapsulated message size given an {{IpcPayload}}, so that an appropriate piece 
of memory can be allocated. 

> [C++] Develop interface for writing a RecordBatch IPC stream into 
> pre-allocated space (e.g. memory map) that avoids unnecessary serialization
> -
>
> Key: ARROW-5377
> URL: https://issues.apache.org/jira/browse/ARROW-5377
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> As discussed in recent mailing list thread
> https://lists.apache.org/thread.html/b756209052fecb8c28a5eb37db7aecb82a5f5351fa79a9d86f0dba3e@%3Cuser.arrow.apache.org%3E
> The only viable process at the moment for getting an accurate report of 
> stream size is to write a simulated stream using {{MockOutputStream}}. This 
> is suboptimal for a couple of reasons:
> * Flatbuffers metadata must be created twice
> * Record batch disassembly into IpcPayload must be performed twice
> It seems like an interface with a very constrained public API could be 
> provided to deconstruct a sequence of RecordBatches and report the size of 
> the produced IPC stream (based on metadata sizes, and padding), and then this 
> deconstructed set of IPC payloads can be written out to a stream (e.g. using 
> {{FixedSizeBufferWriter}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6613:
-

 Summary: [C++] Remove dependency on boost::filesystem
 Key: ARROW-6613
 URL: https://issues.apache.org/jira/browse/ARROW-6613
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 1.0.0


See ARROW-2196 for details.
boost::filesystem should not be required for base functionality at least 
(including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2196:
--
Fix Version/s: 1.0.0

> [C++] Consider quarantining platform code with dependency on non-header Boost 
> code
> --
>
> Key: ARROW-2196
> URL: https://issues.apache.org/jira/browse/ARROW-2196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> see discussion in ARROW-2193 for the motivation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6612) [C++] Add ARROW_CSV CMake build flag

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6612:
---

 Summary: [C++] Add ARROW_CSV CMake build flag
 Key: ARROW-6612
 URL: https://issues.apache.org/jira/browse/ARROW-6612
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I think it would be better to make building this part of the project not 
unconditional



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6611) [C++] Make ARROW_JSON=OFF the default

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6611:
---

 Summary: [C++] Make ARROW_JSON=OFF the default
 Key: ARROW-6611
 URL: https://issues.apache.org/jira/browse/ARROW-6611
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The JSON-based functionality is only needed for 

* Integration tests
* Unit tests
* JSON scanning

If the user opts in to unit tests or integration tests, then we can flip it on, 
but I think that the user should opt in when building libarrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932797#comment-16932797
 ] 

Mitar commented on ARROW-1664:
--

It is like extension of DataFrame to multiple dimensions.
{quote}Xarray introduces labels in the form of dimensions, coordinates and 
attributes on top of raw [NumPy|http://www.numpy.org/]-like arrays, which 
allows for a more intuitive, more concise, and less error-prone developer 
experience. The package includes a large and growing library of domain-agnostic 
functions for advanced analytics and visualization with these data structures.

Xarray was inspired by and borrows heavily from 
[pandas|http://pandas.pydata.org/], the popular data analysis package focused 
on labelled tabular data.
{quote}
So internally it is ndarrays. This is why I think serialization could be 
possible, similar to how Pandas DataFrames internally use ndarrays.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6610) [C++] Add ARROW_FILESYSTEM=ON/OFF CMake configuration flag

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6610:
---

 Summary: [C++] Add ARROW_FILESYSTEM=ON/OFF CMake configuration flag
 Key: ARROW-6610
 URL: https://issues.apache.org/jira/browse/ARROW-6610
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Building this code should not be required in order to take advantage of the 
columnar core (memory allocation, data structures, IPC)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2196) [C++] Consider quarantining platform code with dependency on non-header Boost code

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932794#comment-16932794
 ] 

Wes McKinney commented on ARROW-2196:
-

I think we should make the arrow/filesystem code optional with 
{{ARROW_FILESYSTEM=ON/OFF}} and try to avoid other usages of boost::filesystem 
when producing a barebones libarrow, per mailing list discussion

> [C++] Consider quarantining platform code with dependency on non-header Boost 
> code
> --
>
> Key: ARROW-2196
> URL: https://issues.apache.org/jira/browse/ARROW-2196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> see discussion in ARROW-2193 for the motivation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-1669) [C++] Consider adding Abseil (Google C++11 standard library extensions) to toolchain

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1669.
---

> [C++] Consider adding Abseil (Google C++11 standard library extensions) to 
> toolchain
> 
>
> Key: ARROW-1669
> URL: https://issues.apache.org/jira/browse/ARROW-1669
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Google has released a library of C++11-compliant extensions to the STL that 
> may help make a lot of Arrow code simpler:
> https://github.com/abseil/abseil-cpp/
> This code is not header-only and so would require some effort to add to the 
> toolchain at the moment since it only supports the Bazel build system



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6609) [C++] Add minimal build Dockerfile example

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6609:
--
Labels: pull-request-available  (was: )

> [C++] Add minimal build Dockerfile example
> --
>
> Key: ARROW-6609
> URL: https://issues.apache.org/jira/browse/ARROW-6609
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> This will also help developers test a minimal build configuration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932790#comment-16932790
 ] 

Antoine Pitrou commented on ARROW-1664:
---

Does xarray have a Table-like or DataFrame-like concept?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932788#comment-16932788
 ] 

Mitar commented on ARROW-1664:
--

I see. So why not also have then `pa.Table.from_xarray`?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2051) [Python] Support serializing UUID objects to tables

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932786#comment-16932786
 ] 

Mitar commented on ARROW-2051:
--

Sounds good. I will then explore how to do that through extension types.

> [Python] Support serializing UUID objects to tables
> ---
>
> Key: ARROW-2051
> URL: https://issues.apache.org/jira/browse/ARROW-2051
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Omer Katz
>Priority: Major
>
> UUID objects can be easily supported and can be represented as 128-bit 
> integers or a stream of bytes.
> The fastest way I know to construct a UUID object is by using it's 128-bit 
> (16 bytes) integer representation.
>  
> {code:java}
> %timeit uuid.UUID(int=24197857161011715162171839636988778104)
> 611 ns ± 6.27 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
> %timeit uuid.UUID(bytes=b'\x124Vx\x124Vx\x124Vx\x124Vx')
> 1.17 µs ± 7.5 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
> %timeit uuid.UUID('12345678-1234-5678-1234-567812345678')
> 1.47 µs ± 6.08 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
> {code}
>  
> Right now I have to do this manually which is pretty tedious.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6606:
--
Labels: pull-request-available  (was: )

> [C++] Construct tree structure from std::vector
> --
>
> Key: ARROW-6606
> URL: https://issues.apache.org/jira/browse/ARROW-6606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>
> This will be used by FileSystemDataSource for pushdown predicate pruning of 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6605) [C++] Add recursion depth control to fs::Selector

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6605:
--
Labels: pull-request-available  (was: )

> [C++] Add recursion depth control to fs::Selector
> -
>
> Key: ARROW-6605
> URL: https://issues.apache.org/jira/browse/ARROW-6605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> This is similar to the recursive options, but also control the depth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-18 Thread Francois Saint-Jacques (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932772#comment-16932772
 ] 

Francois Saint-Jacques commented on ARROW-6606:
---

This is my plan/implementation, pending PR in a few hours.

> [C++] Construct tree structure from std::vector
> --
>
> Key: ARROW-6606
> URL: https://issues.apache.org/jira/browse/ARROW-6606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This will be used by FileSystemDataSource for pushdown predicate pruning of 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6590) [C++] Do not require ARROW_JSON=ON when ARROW_IPC=ON

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6590:
--
Labels: pull-request-available  (was: )

> [C++] Do not require ARROW_JSON=ON when ARROW_IPC=ON
> 
>
> Key: ARROW-6590
> URL: https://issues.apache.org/jira/browse/ARROW-6590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> arrow/CMakeLists.txt currently has
> {code}
> if(ARROW_IPC AND NOT ARROW_JSON)
>   message(FATAL_ERROR "JSON support is required for Arrow IPC")
> endif()
> {code}
> Building the JSON scanner component should not be a pre-requisite of building 
> IPC support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6609) [C++] Add minimal build Dockerfile example

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6609:
---

 Summary: [C++] Add minimal build Dockerfile example
 Key: ARROW-6609
 URL: https://issues.apache.org/jira/browse/ARROW-6609
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.15.0


This will also help developers test a minimal build configuration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6589) [C++] Support BinaryType in MakeArrayOfNull

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6589.
-
Resolution: Fixed

Issue resolved by pull request 5421
[https://github.com/apache/arrow/pull/5421]

> [C++] Support BinaryType in MakeArrayOfNull
> ---
>
> Key: ARROW-6589
> URL: https://issues.apache.org/jira/browse/ARROW-6589
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This function does not even return an error Status when passed an instance of 
> BinaryType



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6336) [Python] Clarify pyarrow.serialize/deserialize docstrings viz-a-viz relationship with Arrow IPC protocol

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6336:
--
Labels: pull-request-available  (was: )

> [Python] Clarify pyarrow.serialize/deserialize docstrings viz-a-viz 
> relationship with Arrow IPC protocol
> 
>
> Key: ARROW-6336
> URL: https://issues.apache.org/jira/browse/ARROW-6336
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> Some users have been confused that these functions are equivalent in some way 
> to IPC streams. We should add language explaining in more detail what they do 
> and when to use them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6590) [C++] Do not require ARROW_JSON=ON when ARROW_IPC=ON

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6590:
---

Assignee: Wes McKinney

> [C++] Do not require ARROW_JSON=ON when ARROW_IPC=ON
> 
>
> Key: ARROW-6590
> URL: https://issues.apache.org/jira/browse/ARROW-6590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> arrow/CMakeLists.txt currently has
> {code}
> if(ARROW_IPC AND NOT ARROW_JSON)
>   message(FATAL_ERROR "JSON support is required for Arrow IPC")
> endif()
> {code}
> Building the JSON scanner component should not be a pre-requisite of building 
> IPC support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6336) [Python] Clarify pyarrow.serialize/deserialize docstrings viz-a-viz relationship with Arrow IPC protocol

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-6336:
---

Assignee: Wes McKinney

> [Python] Clarify pyarrow.serialize/deserialize docstrings viz-a-viz 
> relationship with Arrow IPC protocol
> 
>
> Key: ARROW-6336
> URL: https://issues.apache.org/jira/browse/ARROW-6336
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.15.0
>
>
> Some users have been confused that these functions are equivalent in some way 
> to IPC streams. We should add language explaining in more detail what they do 
> and when to use them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6474) [Python] Provide mechanism for python to write out old format

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6474.
-
Resolution: Fixed

Issue resolved by pull request 5396
[https://github.com/apache/arrow/pull/5396]

> [Python] Provide mechanism for python to write out old format
> -
>
> Key: ARROW-6474
> URL: https://issues.apache.org/jira/browse/ARROW-6474
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Format, Python
>Reporter: Micah Kornfield
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I think this needs to be an environment variable, so it can be made to work 
> with old version of the Java library pyspark integration.
>  
>  [~bryanc] can you check if this captures the requirements?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6527) [C++] Add OutputStream::Write() variant taking an owned buffer

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6527.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5406
[https://github.com/apache/arrow/pull/5406]

> [C++] Add OutputStream::Write() variant taking an owned buffer
> --
>
> Key: ARROW-6527
> URL: https://issues.apache.org/jira/browse/ARROW-6527
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When Write() takes an arbitrary data pointer and needs to buffer it, it is 
> mandatory to copy the data because the pointer may go stale, or the data may 
> be overwritten.
> Buf if the user has an immutable Buffer, then it should be enough to store 
> the Buffer as necessary, without doing a memory copy. We could add a special 
> Write() variant for that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6124) [C++] ArgSort kernel should sort in a single pass (with nulls)

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932740#comment-16932740
 ] 

Wes McKinney commented on ARROW-6124:
-

In general I would like to see a mergesort implementation at some point that 
isn't leaning on these crutches

> [C++] ArgSort kernel should sort in a single pass (with nulls)
> --
>
> Key: ARROW-6124
> URL: https://issues.apache.org/jira/browse/ARROW-6124
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.0
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 2.0.0
>
>
> There's a good chance that merge sort must be implemented (spill to disk, 
> ChunkedArray, ...)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6608) [C++] Make default for ARROW_HDFS to be OFF

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6608:
---

 Summary: [C++] Make default for ARROW_HDFS to be OFF
 Key: ARROW-6608
 URL: https://issues.apache.org/jira/browse/ARROW-6608
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is one optional usage of {{boost::filesystem}} that could be eliminated 
from the simple "core" build



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6573) Segfault when writing to parquet

2019-09-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-6573:
--
Labels: pull-request-available  (was: )

> Segfault when writing to parquet
> 
>
> Key: ARROW-6573
> URL: https://issues.apache.org/jira/browse/ARROW-6573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. 
> Using Anaconda distribution of Python 3.7. 
>Reporter: Josh Weinstock
>Priority: Minor
>  Labels: pull-request-available
>
> When attempting to write out a pyarrow table to parquet I am observing a 
> segfault when there is a mismatch between the schema and the datatypes. 
> Here is a reproducible example:
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet") 
> {code}
> This results in a segfault when writing the table. Running 
>  
> {code:java}
> gdb -ex r --args python test.py 
> {code}
> Yields
>  
>  
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x7fffe8173917 in 
> virtual thunk to 
> parquet::DictEncoderImpl 
> >::Put(parquet::ByteArray const*, int) () from 
> /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>  
>  
> Thanks for all of your arrow work,
> Josh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6337) [R] as_tibble in R API is a misnomer

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-6337.

Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5399
[https://github.com/apache/arrow/pull/5399]

> [R] as_tibble in R API is a misnomer
> 
>
> Key: ARROW-6337
> URL: https://issues.apache.org/jira/browse/ARROW-6337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: James Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> As of ARROW-5190, the R package's dependency on *tibble* has been moved from 
> *Imports* to *Suggests*. However, this signature of *read_parquet(), 
> read_delim_arrow(), read_json_arrow(),* and *read_feather()* all still have 
> an argument called "as_tibble".
> Per [this 
> comment|[https://github.com/apache/arrow/pull/4454#issuecomment-500563214],] 
> the argument "as_tibble" is a misnomer. If that argument is given a TRUE, 
> those functions will return a data.frame, not a tibble. I agree with Neal's 
> comment...I think that instead of changing return type based on a boolean, it 
> would be better for all the *read_** functions to return a data.frame and for 
> other types of operations to get their own functions.
> I think it's important to consider this type of user-facing breaking change 
> _now_ while the R package's version is still 0.x.
> As always, thanks for hearing me out!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6337) [R] as_tibble in R API is a misnomer

2019-09-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-6337:
--

Assignee: James Lamb

> [R] as_tibble in R API is a misnomer
> 
>
> Key: ARROW-6337
> URL: https://issues.apache.org/jira/browse/ARROW-6337
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: James Lamb
>Assignee: James Lamb
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> As of ARROW-5190, the R package's dependency on *tibble* has been moved from 
> *Imports* to *Suggests*. However, this signature of *read_parquet(), 
> read_delim_arrow(), read_json_arrow(),* and *read_feather()* all still have 
> an argument called "as_tibble".
> Per [this 
> comment|[https://github.com/apache/arrow/pull/4454#issuecomment-500563214],] 
> the argument "as_tibble" is a misnomer. If that argument is given a TRUE, 
> those functions will return a data.frame, not a tibble. I agree with Neal's 
> comment...I think that instead of changing return type based on a boolean, it 
> would be better for all the *read_** functions to return a data.frame and for 
> other types of operations to get their own functions.
> I think it's important to consider this type of user-facing breaking change 
> _now_ while the R package's version is still 0.x.
> As always, thanks for hearing me out!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5333) [C++] Fit build option summary into narrower console

2019-09-18 Thread Benjamin Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman reassigned ARROW-5333:


Assignee: Benjamin Kietzman

> [C++] Fit build option summary into narrower console
> 
>
> Key: ARROW-5333
> URL: https://issues.apache.org/jira/browse/ARROW-5333
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Minor
> Fix For: 1.0.0
>
>
> Not urgent, but I noticed that the new build option summary has lines up to 
> 177 characters long from one of the outputs. It would be nice to fit this 
> output into a 80-char-wide or 100-char-wide console for better readability



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-5336) [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal dictionaries

2019-09-18 Thread Benjamin Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Kietzman reassigned ARROW-5336:


Assignee: Benjamin Kietzman

> [C++] Implement arrow::Concatenate for dictionary-encoded arrays with unequal 
> dictionaries
> --
>
> Key: ARROW-5336
> URL: https://issues.apache.org/jira/browse/ARROW-5336
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
> Fix For: 1.0.0
>
>
> Currently (as of ARROW-3144) if any dictionary is different, an error is 
> returned



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6564) [Python] Do not require pandas for invoking Array.__array__

2019-09-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-6564.
-
Resolution: Fixed

Issue resolved by pull request 5402
[https://github.com/apache/arrow/pull/5402]

> [Python] Do not require pandas for invoking Array.__array__
> ---
>
> Key: ARROW-6564
> URL: https://issues.apache.org/jira/browse/ARROW-6564
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See ARROW-6560



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6607) Support for set/list columns in python

2019-09-18 Thread Giora Simchoni (Jira)
Giora Simchoni created ARROW-6607:
-

 Summary: Support for set/list columns in python
 Key: ARROW-6607
 URL: https://issues.apache.org/jira/browse/ARROW-6607
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
 Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
Windows 10
Reporter: Giora Simchoni


Hi,

Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...

```python
import pandas as pd

df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), set([3,4,5])]})

df.to_feather('test.ft')
```

I get:

```
Traceback (most recent call last):
 File "", line 1, in 
 File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
line 2131, in to_feather
 to_feather(self, fname)
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", 
line 83, in to_feather
 feather.write_feather(df, path)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 182, in write_feather
 writer.write(df)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 93, in write
 table = Table.from_pandas(df, preserve_index=False)
 File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 496, in dataframe_to_arrays
 for c, f in zip(columns_to_convert, convert_fields)]
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 496, in 
 for c, f in zip(columns_to_convert, convert_fields)]
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 487, in convert_column
 raise e
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 481, in convert_column
 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
 File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
recognize Python value type when inferring an Arrow data type', 'Conversion 
failed for column b with type object')
```

And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.

Questions:
(1) Is it possible to support these kind of set/list columns?
(2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
set/list columns as this would explode the DataFrame. My only other idea is to 
convert set `\{1,2}` into a string `1,2` and parse it after reading the file. 
And hoping it won't be slow.

 

Update:

With lists column the error is different:

```python
import pandas as pd

df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})

df.to_feather('test.ft')
```

```

Traceback (most recent call last):
 File "", line 1, in 
 File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
line 2131, in to_feather
 to_feather(self, fname)
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", 
line 83, in to_feather
 feather.write_feather(df, path)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 182, in write_feather
 writer.write(df)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 97, in write
 self.writer.write_array(name, col.data.chunk(0))
 File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
 File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: list

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932715#comment-16932715
 ] 

Antoine Pitrou commented on ARROW-6157:
---

Hmm. Perhaps that validation can be moved to a separate method :-)
Then we'll have to make sure that all tests call the thorough validation 
method, rather than the light one.

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6606:
--
Component/s: C++

> [C++] Construct tree structure from std::vector
> --
>
> Key: ARROW-6606
> URL: https://issues.apache.org/jira/browse/ARROW-6606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This will be used by FileSystemDataSource for pushdown predicate pruning of 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932711#comment-16932711
 ] 

Antoine Pitrou commented on ARROW-6606:
---

You mean a tree that mimicks the filesystem hierarchy of the given filesystem 
entries? Perhaps that can be done at the filesystem level.

> [C++] Construct tree structure from std::vector
> --
>
> Key: ARROW-6606
> URL: https://issues.apache.org/jira/browse/ARROW-6606
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Francois Saint-Jacques
>Priority: Major
>
> This will be used by FileSystemDataSource for pushdown predicate pruning of 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4830) [Python] Remove backward compatibility hacks from pyarrow.pandas_compat

2019-09-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932708#comment-16932708
 ] 

Joris Van den Bossche commented on ARROW-4830:
--

For this, I think we should ideally decide on which minimum version of pyarrow  
we want to support (and then also test for this). Or at least we have something 
to communicate about like: "files written by pyarrow >= xx can be read"

> [Python] Remove backward compatibility hacks from pyarrow.pandas_compat
> ---
>
> Key: ARROW-4830
> URL: https://issues.apache.org/jira/browse/ARROW-4830
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This code is growing less maintainable. I think we can remove these backwards 
> compatibility hacks since there are released versions of pyarrow that can be 
> used to read old metadata and "fix" Parquet files if need be



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932703#comment-16932703
 ] 

Joris Van den Bossche commented on ARROW-6157:
--

The ListArray validation actually does something like the latter (it checks if 
all offsets are valid), so there is at least _some_ precedent.

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6606:
-

 Summary: [C++] Construct tree structure from 
std::vector
 Key: ARROW-6606
 URL: https://issues.apache.org/jira/browse/ARROW-6606
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


This will be used by FileSystemDataSource for pushdown predicate pruning of 
branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6605) [C++] Add recursion depth control to fs::Selector

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932695#comment-16932695
 ] 

Antoine Pitrou commented on ARROW-6605:
---

Do you expect it to be necessary for the dataset work?

> [C++] Add recursion depth control to fs::Selector
> -
>
> Key: ARROW-6605
> URL: https://issues.apache.org/jira/browse/ARROW-6605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> This is similar to the recursive options, but also control the depth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6605) [C++] Add recursion depth control to fs::Selector

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6605:
--
Fix Version/s: 1.0.0

> [C++] Add recursion depth control to fs::Selector
> -
>
> Key: ARROW-6605
> URL: https://issues.apache.org/jira/browse/ARROW-6605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Minor
> Fix For: 1.0.0
>
>
> This is similar to the recursive options, but also control the depth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6157:
--
Fix Version/s: 1.0.0

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932693#comment-16932693
 ] 

Antoine Pitrou commented on ARROW-6157:
---

Yes, we may need a {{ValidateData}} method that's more thorough.

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6573) Segfault when writing to parquet

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932691#comment-16932691
 ] 

Wes McKinney commented on ARROW-6573:
-

This raises an exception in master

{code}
import pyarrow as pa
import pyarrow.parquet as pq

data = dict()
data["key"] = [0, 1, 2, 3] # segfault
#data["key"] = ["0", "1", "2", "3"] # no segfault

schema = pa.schema({"key" : pa.string()})

table = pa.Table.from_pydict(data, schema = schema)
print("now writing out test file")
pq.write_table(table, "test.parquet")

## -- End pasted text --
---
ArrowTypeErrorTraceback (most recent call last)
 in 
  8 schema = pa.schema({"key" : pa.string()})
  9 
---> 10 table = pa.Table.from_pydict(data, schema = schema)
 11 print("now writing out test file")
 12 pq.write_table(table, "test.parquet")

~/code/arrow/python/pyarrow/types.pxi in __iter__()

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: Expected a string or bytes object, got a 'int' object
In ../src/arrow/python/common.h, line 241, code: FromBinary(obj, "a string or 
bytes object")
In ../src/arrow/python/python_to_arrow.cc, line 549, code: 
string_view_.FromString(obj, _utf8)
In ../src/arrow/python/python_to_arrow.cc, line 570, code: Append(obj, _full)
In ../src/arrow/python/iterators.h, line 70, code: func(value, 
static_cast(i), _going)
In ../src/arrow/python/python_to_arrow.cc, line 1097, code: 
converter->AppendMultiple(seq, size)
{code}

Might want to add a unit test, though

> Segfault when writing to parquet
> 
>
> Key: ARROW-6573
> URL: https://issues.apache.org/jira/browse/ARROW-6573
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.14.1
> Environment: Ubuntu 16.04. Pyarrow 0.14.1 installed through pip. 
> Using Anaconda distribution of Python 3.7. 
>Reporter: Josh Weinstock
>Priority: Minor
>
> When attempting to write out a pyarrow table to parquet I am observing a 
> segfault when there is a mismatch between the schema and the datatypes. 
> Here is a reproducible example:
>  
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = dict()
> data["key"] = [0, 1, 2, 3] # segfault
> #data["key"] = ["0", "1", "2", "3"] # no segfault
> schema = pa.schema({"key" : pa.string()})
> table = pa.Table.from_pydict(data, schema = schema)
> print("now writing out test file")
> pq.write_table(table, "test.parquet") 
> {code}
> This results in a segfault when writing the table. Running 
>  
> {code:java}
> gdb -ex r --args python test.py 
> {code}
> Yields
>  
>  
> {noformat}
> Program received signal SIGSEGV, Segmentation fault. 0x7fffe8173917 in 
> virtual thunk to 
> parquet::DictEncoderImpl 
> >::Put(parquet::ByteArray const*, int) () from 
> /net/fantasia/home/jweinstk/anaconda3/lib/python3.7/site-packages/pyarrow/libparquet.so.14
> {noformat}
>  
>  
> Thanks for all of your arrow work,
> Josh



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6605) [C++] Add recursion depth control to fs::Selector

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6605:
-

 Summary: [C++] Add recursion depth control to fs::Selector
 Key: ARROW-6605
 URL: https://issues.apache.org/jira/browse/ARROW-6605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


This is similar to the recursive options, but also control the depth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5673) [Crossbow] Support GitLab runners

2019-09-18 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932690#comment-16932690
 ] 

Neal Richardson commented on ARROW-5673:


Now that GitHub Actions are coming out, would that be better/more natural than 
using GitLab?

> [Crossbow] Support GitLab runners
> -
>
> Key: ARROW-5673
> URL: https://issues.apache.org/jira/browse/ARROW-5673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
> Fix For: 2.0.0
>
>
> Description is by [~kou]:
> I want to use GitLab Runner instead of CircleCI.
> Because we can add custom GitLab Runners for us. For example, we can add GPU 
> enabled GitLab Runner to test CUDA enabled Apache Arrow build. We can also 
> increase timeout more than 5h for our GitLab Runners.
> We can use https://gitlab.com/ to run GitLab Runners: 
> https://about.gitlab.com/solutions/github/
> This feature isn't included in the Free tier on GitLab.com (it's available 
> with the Free tier for campaing for now (*1)) but GitLab.com provides Gold 
> tier features to open source projects (*2). So we can use this feature by 
> choosing "CI/CD for external repo" in "New project page" 
> https://gitlab.com/projects/new .
> (*1)
> So, for the next year we are making the GitLab CI/CD for GitHub feature a 
> part of our GitLab.com Free tier.
> (*2)
> As part of our commitment to open source, we offer all public projects 
> our highest tier features (Gold) for free.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5854) [Python] Expose compare kernels on Array class

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-5854:
--
Fix Version/s: 2.0.0

> [Python] Expose compare kernels on Array class
> --
>
> Key: ARROW-5854
> URL: https://issues.apache.org/jira/browse/ARROW-5854
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 2.0.0
>
>
> Expose the compare kernel for comparing with scalar or array (ARROW-3087, 
> ARROW-4990) on the python Array class.
> This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932688#comment-16932688
 ] 

Antoine Pitrou commented on ARROW-1664:
---

> There is no special handling of Pandas DataFrame in arrow?

What do you mean? You can ingest a DataFrame using pa.Table.from_pandas(), for 
example.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932685#comment-16932685
 ] 

Antoine Pitrou commented on ARROW-4633:
---

[~jorisvandenbossche]

 

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Priority: Minor
>  Labels: newbie, parquet
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4633:
--
Fix Version/s: 1.0.0

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> --
>
> Key: ARROW-4633
> URL: https://issues.apache.org/jira/browse/ARROW-4633
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1, 0.12.0
> Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>Reporter: Taylor Johnson
>Priority: Minor
>  Labels: newbie, parquet
> Fix For: 1.0.0
>
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) 
> still creates a ThreadPool.  This is observed in 
> ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in 
> pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both 
> ParquetReader.read_all() and ParquetReader.read_row_group() pass the 
> use_threads input along to self.reader which is a ParquetReader imported from 
> _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that 
> ParquetReader.read_all() and ParquetReader.read_row_group() have the 
> following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls 
> self.reader.get().set_use_threads(use_threads).  This self.reader is assigned 
> as unique_ptr[FileReader].  I think this points to 
> cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The 
> FileReader::Impl::ReadRowGroup logic looks ok, as a call to 
> ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  
> The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, 
> p.num_threads()))
> {quote}
> ---
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-3457) [Python] Plasma test is failed with GeForce GTK 1050Ti GPU

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-3457:
--
Fix Version/s: 2.0.0

> [Python] Plasma test is failed with GeForce GTK 1050Ti GPU
> --
>
> Key: ARROW-3457
> URL: https://issues.apache.org/jira/browse/ARROW-3457
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Sutou Kouhei
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{ARROW_HAVE_GPU=yes dev/release/verify-release-candidate.sh source 0.11.0 
> 1}} with GeForce GTK 1050Ti reports the following failure:
> {noformat}
> pyarrow/tests/test_plasma.py::TestPlasmaClient::test_create_with_metadata 
> ERROR [ 81%]
> >>> captured stderr 
> >>> 
> /tmp/arrow-0.11.0.Y6KZK/apache-arrow-0.11.0/cpp/src/plasma/store.cc:971: 
> Allowing the Plasma store to use up to 0.1GB of memory.
> /tmp/arrow-0.11.0.Y6KZK/apache-arrow-0.11.0/cpp/src/plasma/store.cc:1001: 
> Starting object store with directory /dev/shm and huge page support disabled
> munmap_chunk(): invalid pointer
> >> traceback 
> >> >>>
> self = 
> test_method =  >
> def teardown_method(self, test_method):
> try:
> # Check that the Plasma store is still alive.
> assert self.p.poll() is None
> # Ensure Valgrind and/or coverage have a clean exit
> # Valgrind misses SIGTERM if it is delivered before the
> # event loop is ready; this race condition is mitigated
> # but not solved by time.sleep().
> if USE_VALGRIND:
> time.sleep(1.0)
> self.p.send_signal(signal.SIGTERM)
> if sys.version_info >= (3, 3):
> self.p.wait(timeout=5)
> else:
> self.p.wait()
> >   assert self.p.returncode == 0
> E   assert -6 == 0
> E+  where -6 =  0x7fc4201aef98>.returncode
> E+where  = 
> .p
> pyarrow/tests/test_plasma.py:141: AssertionError
> > entering PDB 
> > >
> > /tmp/arrow-0.11.0.Y6KZK/apache-arrow-0.11.0/python/pyarrow/tests/test_plasma.py(141)teardown_method()
> -> assert self.p.returncode == 0
> (Pdb) 
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6281) [Python] Produce chunked arrays for nested types in pyarrow.array

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-6281:
--
Fix Version/s: 1.0.0

> [Python] Produce chunked arrays for nested types in pyarrow.array
> -
>
> Key: ARROW-6281
> URL: https://issues.apache.org/jira/browse/ARROW-6281
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> As follow up to ARROW-5028 and other issues, in a case like
> {code}
> vals = [['x' * 1024]] * ((2 << 20) + 1)
> arr = pa.array(vals)
> {code}
> The child array of the ListArray cannot hold all of the string data. After 
> the patch for ARROW-5028, an exception is raised rather than returning a 
> malformed array. We could (with some effort) instead produce a chunked array 
> of list type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2939) [Python] Provide links to documentation pages for old versions

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2939:
--
Component/s: Documentation

> [Python] Provide links to documentation pages for old versions
> --
>
> Key: ARROW-2939
> URL: https://issues.apache.org/jira/browse/ARROW-2939
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Ian Robertson
>Priority: Minor
>  Labels: documentation
>
> Hey folks, apologies if this isn't the right place to raise this.  In poking 
> around the web documentation (for pyarrow specifically), it looks like the 
> auto-generated API docs contain commits past the release of 0.9.0.  For 
> example:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.column]
>  * Contains differences merged here: 
> [https://github.com/apache/arrow/pull/1923]
>  * But latest pypi/conda versions of pyarrow are 0.9.0, which don't include 
> that change.
> Not sure if the docs are auto-built off master somewhere, I couldn't find 
> anything about building docs in the docs itself.  I would guess that you may 
> want some of the usage docs to be published in between releases if they're 
> not about new functionality, but the API reference being out of date can be 
> confusing.  Is it possible to anchor the API docs to the latest released 
> version?  Or even something like how Pandas has a whole bunch of old versions 
> still available? (e.g. [https://pandas.pydata.org/pandas-docs/stable/] vs. 
> old versions like [http://pandas.pydata.org/pandas-docs/version/0.17.0/])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-2939) [Python] Provide links to documentation pages for old versions

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2939:
--
Fix Version/s: 1.0.0

> [Python] Provide links to documentation pages for old versions
> --
>
> Key: ARROW-2939
> URL: https://issues.apache.org/jira/browse/ARROW-2939
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Ian Robertson
>Priority: Minor
>  Labels: documentation
> Fix For: 1.0.0
>
>
> Hey folks, apologies if this isn't the right place to raise this.  In poking 
> around the web documentation (for pyarrow specifically), it looks like the 
> auto-generated API docs contain commits past the release of 0.9.0.  For 
> example:
>  * 
> [https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.column]
>  * Contains differences merged here: 
> [https://github.com/apache/arrow/pull/1923]
>  * But latest pypi/conda versions of pyarrow are 0.9.0, which don't include 
> that change.
> Not sure if the docs are auto-built off master somewhere, I couldn't find 
> anything about building docs in the docs itself.  I would guess that you may 
> want some of the usage docs to be published in between releases if they're 
> not about new functionality, but the API reference being out of date can be 
> confusing.  Is it possible to anchor the API docs to the latest released 
> version?  Or even something like how Pandas has a whole bunch of old versions 
> still available? (e.g. [https://pandas.pydata.org/pandas-docs/stable/] vs. 
> old versions like [http://pandas.pydata.org/pandas-docs/version/0.17.0/])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4930:
--
Fix Version/s: 2.0.0

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
> Fix For: 2.0.0
>
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932684#comment-16932684
 ] 

Antoine Pitrou commented on ARROW-4930:
---

[~suvayu] If this important for you, can you try submitting a PR?

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >