[jira] [Commented] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391042#comment-16391042
 ] 

ASF GitHub Bot commented on ARROW-2262:
---

pitrou commented on a change in pull request #1702: ARROW-2262: [Python] 
Support slicing on pyarrow.ChunkedArray
URL: https://github.com/apache/arrow/pull/1702#discussion_r173120083
 
 

 ##
 File path: python/pyarrow/table.pxi
 ##
 @@ -77,6 +77,52 @@ cdef class ChunkedArray:
 self._check_nullptr()
 return self.chunked_array.null_count()
 
+def __getitem__(self, key):
+cdef int64_t item
+cdef int i
+self._check_nullptr()
+if isinstance(key, slice):
+return _normalize_slice(self, key)
+elif isinstance(key, six.integer_types):
+item = key
+if item >= self.chunked_array.length() or item < 0:
+return IndexError("ChunkedArray selection out of bounds")
 
 Review comment:
   If we allow negative slice bounds, I would expect us to also allow negative 
indices. Seems like it's time for a `_normalize_index` function?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Support slicing on pyarrow.ChunkedArray
> 
>
> Key: ARROW-2262
> URL: https://issues.apache.org/jira/browse/ARROW-2262
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2288) [Python] slicing logic defective

2018-03-08 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2288:
-

 Summary: [Python] slicing logic defective
 Key: ARROW-2288
 URL: https://issues.apache.org/jira/browse/ARROW-2288
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The slicing logic tends to go too far when normalizing large negative bounds, 
which leads to results not in line with Python's slicing semantics:
{code}
>>> arr = pa.array([1,2,3,4])
>>> arr[-99:100]

[
  2,
  3,
  4
]
>>> arr.to_pylist()[-99:100]
[1, 2, 3, 4]
>>> 
>>> 
>>> arr[-6:-5]

[
  3
]
>>> arr.to_pylist()[-6:-5]
[]
{code}
Also note this crash:
{code}
>>> arr[10:13]
/home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= 
(data.length) 
Abandon (core dumped)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2288) [Python] slicing logic defective

2018-03-08 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391052#comment-16391052
 ] 

Antoine Pitrou commented on ARROW-2288:
---

As for the crash: since {{Array::Slice}} adjusts the length when too large, it 
would make sense for it to also adjust the offset instead of crashing, IMO.

> [Python] slicing logic defective
> 
>
> Key: ARROW-2288
> URL: https://issues.apache.org/jira/browse/ARROW-2288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> The slicing logic tends to go too far when normalizing large negative bounds, 
> which leads to results not in line with Python's slicing semantics:
> {code}
> >>> arr = pa.array([1,2,3,4])
> >>> arr[-99:100]
> 
> [
>   2,
>   3,
>   4
> ]
> >>> arr.to_pylist()[-99:100]
> [1, 2, 3, 4]
> >>> 
> >>> 
> >>> arr[-6:-5]
> 
> [
>   3
> ]
> >>> arr.to_pylist()[-6:-5]
> []
> {code}
> Also note this crash:
> {code}
> >>> arr[10:13]
> /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= 
> (data.length) 
> Abandon (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2288) [Python] slicing logic defective

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391119#comment-16391119
 ] 

ASF GitHub Bot commented on ARROW-2288:
---

pitrou opened a new pull request #1723: ARROW-2288: [Python] Fix slicing logic
URL: https://github.com/apache/arrow/pull/1723
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] slicing logic defective
> 
>
> Key: ARROW-2288
> URL: https://issues.apache.org/jira/browse/ARROW-2288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> The slicing logic tends to go too far when normalizing large negative bounds, 
> which leads to results not in line with Python's slicing semantics:
> {code}
> >>> arr = pa.array([1,2,3,4])
> >>> arr[-99:100]
> 
> [
>   2,
>   3,
>   4
> ]
> >>> arr.to_pylist()[-99:100]
> [1, 2, 3, 4]
> >>> 
> >>> 
> >>> arr[-6:-5]
> 
> [
>   3
> ]
> >>> arr.to_pylist()[-6:-5]
> []
> {code}
> Also note this crash:
> {code}
> >>> arr[10:13]
> /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= 
> (data.length) 
> Abandon (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2288) [Python] slicing logic defective

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2288:
--
Labels: pull-request-available  (was: )

> [Python] slicing logic defective
> 
>
> Key: ARROW-2288
> URL: https://issues.apache.org/jira/browse/ARROW-2288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> The slicing logic tends to go too far when normalizing large negative bounds, 
> which leads to results not in line with Python's slicing semantics:
> {code}
> >>> arr = pa.array([1,2,3,4])
> >>> arr[-99:100]
> 
> [
>   2,
>   3,
>   4
> ]
> >>> arr.to_pylist()[-99:100]
> [1, 2, 3, 4]
> >>> 
> >>> 
> >>> arr[-6:-5]
> 
> [
>   3
> ]
> >>> arr.to_pylist()[-6:-5]
> []
> {code}
> Also note this crash:
> {code}
> >>> arr[10:13]
> /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= 
> (data.length) 
> Abandon (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2284) [Python] test_plasma error on plasma_store error

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391130#comment-16391130
 ] 

ASF GitHub Bot commented on ARROW-2284:
---

pitrou opened a new pull request #1724: ARROW-2284: [Python] Fix error display 
on test_plasma error
URL: https://github.com/apache/arrow/pull/1724
 
 
   Just a trivial fix.  stderr is captured by py.test, not by the subprocess 
call.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] test_plasma error on plasma_store error
> 
>
> Key: ARROW-2284
> URL: https://issues.apache.org/jira/browse/ARROW-2284
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> This appears caused by my latest changes:
> {code:python}
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, 
> in setup_method
>     plasma_store_name, self.p = self.plasma_store_ctx.__enter__()
>   File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", 
> line 81, in __enter__
>     return next(self.gen)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, 
> in start_plasma_store
>     err = proc.stderr.read().decode()
> AttributeError: 'NoneType' object has no attribute 'read'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2284) [Python] test_plasma error on plasma_store error

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2284:
--
Labels: pull-request-available  (was: )

> [Python] test_plasma error on plasma_store error
> 
>
> Key: ARROW-2284
> URL: https://issues.apache.org/jira/browse/ARROW-2284
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> This appears caused by my latest changes:
> {code:python}
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, 
> in setup_method
>     plasma_store_name, self.p = self.plasma_store_ctx.__enter__()
>   File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", 
> line 81, in __enter__
>     return next(self.gen)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, 
> in start_plasma_store
>     err = proc.stderr.read().decode()
> AttributeError: 'NoneType' object has no attribute 'read'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2288) [Python] slicing logic defective

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391158#comment-16391158
 ] 

ASF GitHub Bot commented on ARROW-2288:
---

pitrou commented on issue #1723: ARROW-2288: [Python] Fix slicing logic
URL: https://github.com/apache/arrow/pull/1723#issuecomment-371470276
 
 
   AppVeyor at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.173


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] slicing logic defective
> 
>
> Key: ARROW-2288
> URL: https://issues.apache.org/jira/browse/ARROW-2288
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> The slicing logic tends to go too far when normalizing large negative bounds, 
> which leads to results not in line with Python's slicing semantics:
> {code}
> >>> arr = pa.array([1,2,3,4])
> >>> arr[-99:100]
> 
> [
>   2,
>   3,
>   4
> ]
> >>> arr.to_pylist()[-99:100]
> [1, 2, 3, 4]
> >>> 
> >>> 
> >>> arr[-6:-5]
> 
> [
>   3
> ]
> >>> arr.to_pylist()[-6:-5]
> []
> {code}
> Also note this crash:
> {code}
> >>> arr[10:13]
> /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= 
> (data.length) 
> Abandon (core dumped)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391179#comment-16391179
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-371474232
 
 
   Rebased.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391241#comment-16391241
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-371484306
 
 
   AppVeyor at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.175


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391292#comment-16391292
 ] 

Antoine Pitrou commented on ARROW-1974:
---

The problem here is that {{FileReader::Impl::ReadTable}} creates a {{Table}} 
with a schema that has one more field than the number of physical columns. The 
underlying cause seems to be that {{ColumnIndicesToFieldIndices}} uses 
{{Group::FieldIndex}} which looks up the field by name... Also 
{{Group::Equals}} has a bit surprising semantics (why doesn't 
{{GroupNode::FieldIndex(const Node& node)}} simply look up the node by pointer 
equality?).

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2267) Rust bindings

2018-03-08 Thread Joshua Howard (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391302#comment-16391302
 ] 

Joshua Howard commented on ARROW-2267:
--

I spent some time looking into the cpp implementation and it seems like the 
initial steps should be to port the following objects to rust.
 # MemoryPool
 # Buffer
 # Builder
 # Array

The biggest divergence from cpp that I see is the implementation of the memory 
pool. Implementing MemoryPool would result in unsafe code being used in Rust 
(which is bad obviously). There is an issue open to modify memory alignment of 
structs: [https://github.com/rust-lang/rust/issues/33626.] I think that it 
would be well worth skipping the memory alignment until this development is 
finished. 

> Rust bindings
> -
>
> Key: ARROW-2267
> URL: https://issues.apache.org/jira/browse/ARROW-2267
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Joshua Howard
>Priority: Major
>
> Provide Rust bindings for Arrow. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1974:
--
Labels: pull-request-available  (was: )

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391353#comment-16391353
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou opened a new pull request #447: ARROW-1974: Fix creating Arrow table 
with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391355#comment-16391355
 ] 

Antoine Pitrou commented on ARROW-1974:
---

With https://github.com/apache/parquet-cpp/pull/447, the {{to_pandas()}} call 
will fail with the following error:
{code:python}
  File "table.pxi", line 1059, in pyarrow.lib.Table.to_pandas
  File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 611, in 
table_to_blockmanager
columns = _flatten_single_level_multiindex(columns)
  File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 673, in 
_flatten_single_level_multiindex
raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
{code}

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391413#comment-16391413
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371525784
 
 
   Thanks for doing this. Will review shortly


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391418#comment-16391418
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371527191
 
 
   Unfortunately this doesn't seem sufficient. If I add the following test, I 
get an error and a crash:
   ```diff
   diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
   index 72e65d4..eb5a8ec 100644
   --- a/src/parquet/arrow/arrow-reader-writer-test.cc
   +++ b/src/parquet/arrow/arrow-reader-writer-test.cc
   @@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) {
  }
}

   +TEST(TestArrowReadWrite, TableWithDuplicateColumns) {
   +  using ::arrow::ArrayFromVector;
   +
   +  auto f0 = field("duplicate", ::arrow::int8());
   +  auto f1 = field("duplicate", ::arrow::int16());
   +  auto schema = ::arrow::schema({f0, f1});
   +
   +  std::vector a0_values = {1, 2, 3};
   +  std::vector a1_values = {14, 15, 16};
   +
   +  std::shared_ptr a0, a1;
   +
   +  ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, &a0);
   +  ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, &a1);
   +
   +  auto table = Table::Make(schema,
   +   {std::make_shared(f0->name(), a0),
   +std::make_shared(f1->name(), a1)});
   +  CheckSimpleRoundtrip(table, table->num_rows());
   +}
   +
TEST(TestArrowWrite, CheckChunkSize) {
  const int num_columns = 2;
  const int num_rows = 128;
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391449#comment-16391449
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371534052
 
 
   Ok, the reason for the error is that a similar pattern needs fixing in 
`SchemaDescriptor`. Updating shortly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2239) [C++] Update build docs for Windows

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391505#comment-16391505
 ] 

ASF GitHub Bot commented on ARROW-2239:
---

wesm closed pull request #1722: ARROW-2239: [C++] Update Windows build docs
URL: https://github.com/apache/arrow/pull/1722
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/apidoc/Windows.md b/cpp/apidoc/Windows.md
index dae5040c2..965369521 100644
--- a/cpp/apidoc/Windows.md
+++ b/cpp/apidoc/Windows.md
@@ -44,9 +44,8 @@ Now, you can bootstrap a build environment
 conda create -n arrow-dev cmake git boost-cpp flatbuffers rapidjson cmake 
thrift-cpp snappy zlib brotli gflags lz4-c zstd -c conda-forge
 ```
 
-***Note:***
-> *Make sure to get the `conda-forge` build of `gflags` as the
-  naming of the library differs from that in the `defaults` channel*
+> **Note:** Make sure to get the `conda-forge` build of `gflags` as the
+> naming of the library differs from that in the `defaults` channel.
 
 Activate just created conda environment with pre-installed packages from
 previous step:
@@ -116,52 +115,85 @@ zstd%ZSTD_SUFFIX%.lib.
 ### Visual Studio
 
 Microsoft provides the free Visual Studio Community edition. When doing
-development, you must launch the developer command prompt using
+development, you must launch the developer command prompt using:
 
  Visual Studio 2015
 
-```"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" 
amd64```
+```
+"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" amd64
+```
 
  Visual Studio 2017
 
-```"C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\Common7\Tools\VsDevCmd.bat" -arch=amd64```
+```
+"C:\Program Files (x86)\Microsoft Visual 
Studio\2017\Community\Common7\Tools\VsDevCmd.bat" -arch=amd64
+```
 
 It's easiest to configure a console emulator like [cmder][3] to automatically
 launch this when starting a new development console.
 
+## Building with Ninja and clcache
+
+We recommend the [Ninja](https://ninja-build.org/) build system for better
+build parallelization, and the optional
+[clcache](https://github.com/frerich/clcache/) compiler cache which keeps
+track of past compilations to avoid running them over and over again
+(in a way similar to the Unix-specific "ccache").
+
+Activate your conda build environment to first install those utilities:
+
+```shell
+activate arrow-dev
+
+conda install -c conda-forge ninja
+pip install git+https://github.com/frerich/clcache.git
+```
+
+Change working directory in cmd.exe to the root directory of Arrow and
+do an out of source build by generating Ninja files:
+
+```shell
+cd cpp
+mkdir build
+cd build
+cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release ..
+cmake --build . --config Release
+```
+
 ## Building with NMake
 
 Activate your conda build environment:
 
-```
+```shell
 activate arrow-dev
 ```
 
 Change working directory in cmd.exe to the root directory of Arrow and
 do an out of source build using `nmake`:
 
-```
+```shell
 cd cpp
 mkdir build
 cd build
 cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release ..
+cmake --build . --config Release
 nmake
 ```
 
 When using conda, only release builds are currently supported.
 
-## Build using Visual Studio (MSVC) Solution Files
+## Building using Visual Studio (MSVC) Solution Files
 
 Activate your conda build environment:
 
-```
+```shell
 activate arrow-dev
 ```
 
 Change working directory in cmd.exe to the root directory of Arrow and
 do an out of source build by generating a MSVC solution:
 
-```
+```shell
 cd cpp
 mkdir build
 cd build
@@ -171,10 +203,11 @@ cmake --build . --config Release
 
 ## Debug build
 
-To build Debug version of Arrow you should have pre-insalled Debug version of
-boost libs.
+To build Debug version of Arrow you should have pre-installed a Debug version
+of boost libs.
 
-It's recommended to configure cmake build with following variables for Debug 
build:
+It's recommended to configure cmake build with the following variables for
+Debug build:
 
 `-DARROW_BOOST_USE_SHARED=OFF` - enables static linking with boost debug libs 
and
 simplifies run-time loading of 3rd parties. (Recommended)
@@ -185,7 +218,7 @@ simplifies run-time loading of 3rd parties. (Recommended)
 
 Command line to build Arrow in Debug might look as following:
 
-```
+```shell
 cd cpp
 mkdir build
 cd build


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Upd

[jira] [Resolved] (ARROW-2239) [C++] Update build docs for Windows

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2239.
-
Resolution: Fixed

Issue resolved by pull request 1722
[https://github.com/apache/arrow/pull/1722]

> [C++] Update build docs for Windows
> ---
>
> Key: ARROW-2239
> URL: https://issues.apache.org/jira/browse/ARROW-2239
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We should update the C++ build docs for Windows to recommend use of Ninja and 
> clcache for faster builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2289:
--
Labels: pull-request-available  (was: )

> [GLib] Add  Numeric, Integer and FloatingPoint data types
> -
>
> Key: ARROW-2289
> URL: https://issues.apache.org/jira/browse/ARROW-2289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391540#comment-16391540
 ] 

ASF GitHub Bot commented on ARROW-2289:
---

kou opened a new pull request #1726: ARROW-2289: [GLib] Add Numeric, Integer, 
FloatingPoint data types
URL: https://github.com/apache/arrow/pull/1726
 
 
   They are useful to detect numeric data types.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [GLib] Add  Numeric, Integer and FloatingPoint data types
> -
>
> Key: ARROW-2289
> URL: https://issues.apache.org/jira/browse/ARROW-2289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-2289:
---

 Summary: [GLib] Add  Numeric, Integer and FloatingPoint data types
 Key: ARROW-2289
 URL: https://issues.apache.org/jira/browse/ARROW-2289
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Affects Versions: 0.8.0
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2038:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Follow-up bug fixes for s3fs Parquet support
> -
>
> Key: ARROW-2038
> URL: https://issues.apache.org/jira/browse/ARROW-2038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> see discussion in 
> https://github.com/apache/arrow/pull/916#issuecomment-360558248



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1975) [C++] Add abi-compliance-checker to build process

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1975:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add abi-compliance-checker to build process
> -
>
> Key: ARROW-1975
> URL: https://issues.apache.org/jira/browse/ARROW-1975
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> I would like to check our baseline modules with 
> https://lvc.github.io/abi-compliance-checker/ to ensure that version upgrades 
> are much smoother and that we don‘t break the ABI in patch releases. 
> As we‘re pre-1.0 yet, I accept that there will be breakage but I would like 
> to keep them to a minimum. Currently the biggest pain with Arrow is you need 
> to pin it in Python always with {{==0.x.y}}, otherwise segfaults are 
> inevitable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1988) [Python] Extend flavor=spark in Parquet writing to handle INT types

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1988:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Extend flavor=spark in Parquet writing to handle INT types
> ---
>
> Key: ARROW-1988
> URL: https://issues.apache.org/jira/browse/ARROW-1988
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> See the relevant code sections at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L139
> We should cater for them in the {{pyarrow}} code and also reach out to Spark 
> developers so that they are supported there in the longterm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2014) [Python] Document read_pandas method in pyarrow.parquet

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2014:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Document read_pandas method in pyarrow.parquet
> ---
>
> Key: ARROW-2014
> URL: https://issues.apache.org/jira/browse/ARROW-2014
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.10.0
>
>
> see discussion in https://github.com/apache/arrow/issues/1302



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1454) [Python] More informative error message when attempting to write an unsupported Arrow type to Parquet format

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1454:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] More informative error message when attempting to write an 
> unsupported Arrow type to Parquet format
> 
>
> Key: ARROW-1454
> URL: https://issues.apache.org/jira/browse/ARROW-1454
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See https://github.com/pandas-dev/pandas/issues/17102#issuecomment-326746184



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1974:
---

Assignee: Antoine Pitrou  (was: Phillip Cloud)

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2256) [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2256:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos
> 
>
> Key: ARROW-2256
> URL: https://issues.apache.org/jira/browse/ARROW-2256
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> I did a clean upgrade to 16.04 on one of my machine and ran into the problem 
> described here:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866087
> I think this can be resolved temporarily by symlinking the static library, 
> but we should document the problem so other devs know what to do when it 
> happens



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2263:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line 7, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> == 1 failed in 0.23 seconds 
> ===
> {code}
> I encountered this bit of brittleness in a fresh install where I had not run 
> {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support

2018-03-08 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391805#comment-16391805
 ] 

Wes McKinney commented on ARROW-2038:
-

Moving this to 0.10.0, but please feel free to look sooner

> [Python] Follow-up bug fixes for s3fs Parquet support
> -
>
> Key: ARROW-2038
> URL: https://issues.apache.org/jira/browse/ARROW-2038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> see discussion in 
> https://github.com/apache/arrow/pull/916#issuecomment-360558248



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1974:
---

Assignee: Antoine Pitrou  (was: Wes McKinney)

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1974:
---

Assignee: Wes McKinney  (was: Antoine Pitrou)

> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1425:
---

Assignee: Wes McKinney  (was: Li Jin)

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2289:
---

Assignee: Wes McKinney  (was: Kouhei Sutou)

> [GLib] Add  Numeric, Integer and FloatingPoint data types
> -
>
> Key: ARROW-2289
> URL: https://issues.apache.org/jira/browse/ARROW-2289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1425:
---

Assignee: Li Jin  (was: Wes McKinney)

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1996) [Python] pyarrow.read_serialized cannot read concatenated records

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1996:
---

Assignee: Antoine Pitrou  (was: Wes McKinney)

> [Python] pyarrow.read_serialized cannot read concatenated records
> -
>
> Key: ARROW-1996
> URL: https://issues.apache.org/jira/browse/ARROW-1996
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux
>Reporter: Richard Shin
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The following code
> {quote}import pyarrow as pa
> f = pa.OSFile('arrow_test', 'w')
>  pa.serialize_to(12, f)
>  pa.serialize_to(23, f)
>  f.close()
> f = pa.OSFile('arrow_test', 'r')
>  print(pa.read_serialized(f).deserialize())
>  print(pa.read_serialized(f).deserialize())
>  f.close()
> {quote}
> gives the following result:
> {quote}$ python pyarrow_test.py
>  First: 12
>  Traceback (most recent call last):
>  File "pyarrow_test.py", line 10, in 
>  print('Second: {}'.format(pa.read_serialized(f).deserialize()))
>  File "pyarrow/serialization.pxi", line 347, in pyarrow.lib.read_serialized 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:79159)
>  File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8270)
>  pyarrow.lib.ArrowInvalid: Expected schema message in stream, was null or 
> length 0
> {quote}
> I would have expected read_serialized to sucessfully read the second value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1996) [Python] pyarrow.read_serialized cannot read concatenated records

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1996:
---

Assignee: Wes McKinney  (was: Antoine Pitrou)

> [Python] pyarrow.read_serialized cannot read concatenated records
> -
>
> Key: ARROW-1996
> URL: https://issues.apache.org/jira/browse/ARROW-1996
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux
>Reporter: Richard Shin
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The following code
> {quote}import pyarrow as pa
> f = pa.OSFile('arrow_test', 'w')
>  pa.serialize_to(12, f)
>  pa.serialize_to(23, f)
>  f.close()
> f = pa.OSFile('arrow_test', 'r')
>  print(pa.read_serialized(f).deserialize())
>  print(pa.read_serialized(f).deserialize())
>  f.close()
> {quote}
> gives the following result:
> {quote}$ python pyarrow_test.py
>  First: 12
>  Traceback (most recent call last):
>  File "pyarrow_test.py", line 10, in 
>  print('Second: {}'.format(pa.read_serialized(f).deserialize()))
>  File "pyarrow/serialization.pxi", line 347, in pyarrow.lib.read_serialized 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:79159)
>  File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8270)
>  pyarrow.lib.ArrowInvalid: Expected schema message in stream, was null or 
> length 0
> {quote}
> I would have expected read_serialized to sucessfully read the second value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2289:
---

Assignee: Kouhei Sutou  (was: Wes McKinney)

> [GLib] Add  Numeric, Integer and FloatingPoint data types
> -
>
> Key: ARROW-2289
> URL: https://issues.apache.org/jira/browse/ARROW-2289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1425:
---

Assignee: Li Jin  (was: Heimir Thor Sverrisson)

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391917#comment-16391917
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173292620
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   Couple of questions:
   
   * I see [this language regarding the iteration 
order](http://en.cppreference.com/w/cpp/container/unordered_multimap) of the 
values for a particular key in the multimap:
   
   > every group of elements whose keys compare equivalent (compare equal with 
key_eq() as the comparator) forms a contiguous subrange in the iteration order
   
   Does the `iteration order` here mean that the values are iterated over in 
the order in which they were inserted?
   
   * Why did you choose to return the first one instead of returning `-1` (or 
maybe `-2`) for the `std::string` overload? Do we not want to provide a way to 
indicate that column indexes and column names are not 1:1 in the C++ API? Maybe 
that already exists.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391924#comment-16391924
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

pitrou commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173294153
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   1) That's a good point. The fact that the container is unordered means it 
isn't guaranteed to retain insertion order, even for values which map to the 
same key (I would expect a straightforward implementation to maintain that 
order, though). I should probably remove the sentence above.
   
   2) Because doing otherwise seems like it could break compatibility. Not sure 
how strongly you feel about it. The `std::string` overloads aren't used anymore 
in the parquet-cpp codebase, AFAICT.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391948#comment-16391948
 ] 

ASF GitHub Bot commented on ARROW-1974:
---

cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173300635
 
 

 ##
 File path: src/parquet/schema.h
 ##
 @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node {
   bool Equals(const Node* other) const override;
 
   NodePtr field(int i) const { return fields_[i]; }
+  // Get the index of a field by its name, or negative value if not found
+  // If several fields share the same name, the smallest index is returned
 
 Review comment:
   > it could break compatibility
   
   True, though IIUC wouldn't this potentially segfault if you tried to use the 
result to index into something?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Segfault when working with Arrow tables with duplicate columns
> ---
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec

2018-03-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2290:
---

 Summary: [C++/Python] Add ability to set codec options for lz4 
codec
 Key: ARROW-2290
 URL: https://issues.apache.org/jira/browse/ARROW-2290
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Wes McKinney


The LZ4 library has many parameters, currently we do not expose these in C++ or 
Python



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392031#comment-16392031
 ] 

ASF GitHub Bot commented on ARROW-2282:
---

wesm commented on issue #1720: ARROW-2282: [Python] Create StringArray from 
buffers
URL: https://github.com/apache/arrow/pull/1720#issuecomment-371648673
 
 
   rebased


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StringArray from buffers
> 
>
> Key: ARROW-2282
> URL: https://issues.apache.org/jira/browse/ARROW-2282
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> While we will add a more general-purpose functionality in 
> https://issues.apache.org/jira/browse/ARROW-2281, the interface is more 
> complicate then the constructor that explicitly states all arguments:  
> {{StringArray(int64_t length, const std::shared_ptr& value_offsets, 
> …}}
> Thus I will also expose this explicit constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392101#comment-16392101
 ] 

Lawrence Chan commented on ARROW-2290:
--

For what it's worth, this isn't lz4-specific, I just happen to be working with 
that at the moment.

> [C++/Python] Add ability to set codec options for lz4 codec
> ---
>
> Key: ARROW-2290
> URL: https://issues.apache.org/jira/browse/ARROW-2290
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
>
> The LZ4 library has many parameters, currently we do not expose these in C++ 
> or Python



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/8/18 11:46 PM:
--

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that approach 
that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that latter 
approach that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124
 ] 

Lawrence Chan commented on ARROW-300:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that latter 
approach that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use

2018-03-08 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-2181:
---

Assignee: Bryan Cutler

> [Python] Add concat_tables to API reference, add documentation on use
> -
>
> Key: ARROW-2181
> URL: https://issues.apache.org/jira/browse/ARROW-2181
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 0.9.0
>
>
> This omission of documentation was mentioned on the mailing list on February 
> 13. The documentation should illustrate the contrast between 
> {{Table.from_batches}} and {{concat_tables}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:00 AM:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. My current workaround uses a fixed length byte array but it's 
pretty clunky to do this efficiently, at least in the parquet-cpp 
implementation. There are maybe also some alignment concerns with that approach 
that I'm just ignoring right now.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392262#comment-16392262
 ] 

Wes McKinney commented on ARROW-300:


We haven't done any work on this yet. I think the first step would be to 
propose additional metadata (in the Flatbuffers files) for record batches to 
indicate the style of compression. 

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format

2018-03-08 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124
 ] 

Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:09 AM:
-

What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression. I tried to hack it up with FixedLenByteArray but there are a 
slew of complications with that, not to mention alignment concerns etc.

Anyways I'm happy to help on this, but I'm not familiar enough with the code 
base to place it in the right spot. If we make a branch with some 
TODOs/placeholders I can probably plug in more easily.


was (Author: llchan):
What did we decide with this? Imho there's still a use case for compressed 
arrow files due to the limited storage types in parquet. I don't really love 
the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away 
with compression.

Happy to help, but I'm not familiar enough with the code base to place it in 
the right spot. If we make a branch with some TODOs/placeholders I can probably 
plug in more easily.

> [Format] Add buffer compression option to IPC file format
> -
>
> Key: ARROW-300
> URL: https://issues.apache.org/jira/browse/ARROW-300
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> It may be useful if data is to be sent over the wire to compress the data 
> buffers themselves as their being written in the file layout.
> I would propose that we keep this extremely simple with a global buffer 
> compression setting in the file Footer. Probably only two compressors worth 
> supporting out of the box would be zlib (higher compression ratios) and lz4 
> (better performance).
> What does everyone think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2150) [Python] array equality defaults to identity

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2150:
---

Assignee: Wes McKinney

> [Python] array equality defaults to identity
> 
>
> Key: ARROW-2150
> URL: https://issues.apache.org/jira/browse/ARROW-2150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.9.0
>
>
> I'm not sure this is deliberate, but it doesn't look very desirable to me:
> {code}
> >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32())
> False
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2150) [Python] array equality defaults to identity

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392323#comment-16392323
 ] 

ASF GitHub Bot commented on ARROW-2150:
---

wesm opened a new pull request #1729: ARROW-2150: [Python] Raise 
NotImplementedError when comparing with pyarrow.Array for now
URL: https://github.com/apache/arrow/pull/1729
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] array equality defaults to identity
> 
>
> Key: ARROW-2150
> URL: https://issues.apache.org/jira/browse/ARROW-2150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I'm not sure this is deliberate, but it doesn't look very desirable to me:
> {code}
> >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32())
> False
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2150) [Python] array equality defaults to identity

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2150:
--
Labels: pull-request-available  (was: )

> [Python] array equality defaults to identity
> 
>
> Key: ARROW-2150
> URL: https://issues.apache.org/jira/browse/ARROW-2150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I'm not sure this is deliberate, but it doesn't look very desirable to me:
> {code}
> >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32())
> False
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2263:

Fix Version/s: (was: 0.10.0)
   0.9.0

> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line 7, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> == 1 failed in 0.23 seconds 
> ===
> {code}
> I encountered this bit of brittleness in a fresh install where I had not run 
> {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2263:
---

Assignee: Wes McKinney

> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line 7, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> == 1 failed in 0.23 seconds 
> ===
> {code}
> I encountered this bit of brittleness in a fresh install where I had not run 
> {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2263:
--
Labels: pull-request-available  (was: )

> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line 7, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> == 1 failed in 0.23 seconds 
> ===
> {code}
> I encountered this bit of brittleness in a fresh install where I had not run 
> {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392332#comment-16392332
 ] 

ASF GitHub Bot commented on ARROW-2263:
---

wesm commented on issue #1730: ARROW-2263: [Python] Prepend local pyarrow/ path 
to PYTHONPATH in test_cython.py
URL: https://github.com/apache/arrow/pull/1730#issuecomment-371700652
 
 
   This was bugging me -- turned out to be easy to fix. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line

[jira] [Commented] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392330#comment-16392330
 ] 

ASF GitHub Bot commented on ARROW-2263:
---

wesm opened a new pull request #1730: ARROW-2263: [Python] Prepend local 
pyarrow/ path to PYTHONPATH in test_cython.py
URL: https://github.com/apache/arrow/pull/1730
 
 
   This was bugging me -- turned out to be easy to fix. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line 7, in 
> im

[jira] [Assigned] (ARROW-2268) Remove MD5 checksums from release process

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2268:
---

Assignee: Wes McKinney

> Remove MD5 checksums from release process
> -
>
> Key: ARROW-2268
> URL: https://issues.apache.org/jira/browse/ARROW-2268
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> The ASF has changed its release policy for signatures and checksums to 
> contraindicate the use of MD5 checksums: 
> http://www.apache.org/dev/release-distribution#sigs-and-sums. We should 
> remove this from our various release scripts prior to the 0.9.0 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2268) Remove MD5 checksums from release process

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2268:
--
Labels: pull-request-available  (was: )

> Remove MD5 checksums from release process
> -
>
> Key: ARROW-2268
> URL: https://issues.apache.org/jira/browse/ARROW-2268
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The ASF has changed its release policy for signatures and checksums to 
> contraindicate the use of MD5 checksums: 
> http://www.apache.org/dev/release-distribution#sigs-and-sums. We should 
> remove this from our various release scripts prior to the 0.9.0 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2167:
---

Assignee: Wes McKinney

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2268) Remove MD5 checksums from release process

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392338#comment-16392338
 ] 

ASF GitHub Bot commented on ARROW-2268:
---

wesm opened a new pull request #1731: ARROW-2268: Drop usage of md5 checksums 
for source releases, verification scripts
URL: https://github.com/apache/arrow/pull/1731
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove MD5 checksums from release process
> -
>
> Key: ARROW-2268
> URL: https://issues.apache.org/jira/browse/ARROW-2268
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The ASF has changed its release policy for signatures and checksums to 
> contraindicate the use of MD5 checksums: 
> http://www.apache.org/dev/release-distribution#sigs-and-sums. We should 
> remove this from our various release scripts prior to the 0.9.0 release



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1535) [Python] Enable sdist source tarballs to build assuming that Arrow C++ libraries are available on the host system

2018-03-08 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392341#comment-16392341
 ] 

Wes McKinney commented on ARROW-1535:
-

[~kou] in theory this should work now, but we should double check that things 
are still working on master

> [Python] Enable sdist source tarballs to build assuming that Arrow C++ 
> libraries are available on the host system
> -
>
> Key: ARROW-1535
> URL: https://issues.apache.org/jira/browse/ARROW-1535
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: Build, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2167.
-
Resolution: Won't Fix

This seems to be fixed in https://github.com/apache/arrow/pull/1597. Both 
CHECKIN and PRODUCTION warning levels build fine now

we are using the same CMAKE_CXX_FLAGS for EPs -- there are some additional 
suppressions for ORC. I suggest we deal with this on a case by case basis going 
forward

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-2167:
-

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2167.
-
Resolution: Fixed

> [C++] Building Orc extensions fails with the default 
> BUILD_WARNING_LEVEL=Production
> ---
>
> Key: ARROW-2167
> URL: https://issues.apache.org/jira/browse/ARROW-2167
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Building orc_ep fails because there are a bunch of upstream warnings like not 
> providing {{override}} on virtual destructor subclasses, and using {{0}} as 
> the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is 
> {{Production}} which includes {{-Wall}} (all warnings as errors).
> I see that there are different possible options for {{BUILD_WARNING_LEVEL}} 
> so it's possible for developers to deal with this issue.
> It seems easier to let EPs build with whatever the default warning level is 
> for the project rather than force our defaults on those projects.
> Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2291) cpp README missing instructions for libboost-regex-dev

2018-03-08 Thread Andy Grove (JIRA)
Andy Grove created ARROW-2291:
-

 Summary: cpp README missing instructions for libboost-regex-dev
 Key: ARROW-2291
 URL: https://issues.apache.org/jira/browse/ARROW-2291
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
 Environment: Ubuntu 16.04
Reporter: Andy Grove


After following the instructions in the README, I could not generate a makefile 
using CMake because of a missing dependency.

The README needs to be updated to include installing libboost-regex-dev.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2291) cpp README missing instructions for libboost-regex-dev

2018-03-08 Thread Andy Grove (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392366#comment-16392366
 ] 

Andy Grove commented on ARROW-2291:
---

Here is a PR to update the docs: https://github.com/apache/arrow/pull/1732

> cpp README missing instructions for libboost-regex-dev
> --
>
> Key: ARROW-2291
> URL: https://issues.apache.org/jira/browse/ARROW-2291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
> Environment: Ubuntu 16.04
>Reporter: Andy Grove
>Priority: Trivial
>
> After following the instructions in the README, I could not generate a 
> makefile using CMake because of a missing dependency.
> The README needs to be updated to include installing libboost-regex-dev.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2291) [C++] README missing instructions for libboost-regex-dev

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2291:

Summary: [C++] README missing instructions for libboost-regex-dev  (was: 
cpp README missing instructions for libboost-regex-dev)

> [C++] README missing instructions for libboost-regex-dev
> 
>
> Key: ARROW-2291
> URL: https://issues.apache.org/jira/browse/ARROW-2291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
> Environment: Ubuntu 16.04
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>
> After following the instructions in the README, I could not generate a 
> makefile using CMake because of a missing dependency.
> The README needs to be updated to include installing libboost-regex-dev.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2291) cpp README missing instructions for libboost-regex-dev

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2291:
---

Assignee: Andy Grove

> cpp README missing instructions for libboost-regex-dev
> --
>
> Key: ARROW-2291
> URL: https://issues.apache.org/jira/browse/ARROW-2291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
> Environment: Ubuntu 16.04
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Trivial
>
> After following the instructions in the README, I could not generate a 
> makefile using CMake because of a missing dependency.
> The README needs to be updated to include installing libboost-regex-dev.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392372#comment-16392372
 ] 

ASF GitHub Bot commented on ARROW-2263:
---

wesm closed pull request #1730: ARROW-2263: [Python] Prepend local pyarrow/ 
path to PYTHONPATH in test_cython.py
URL: https://github.com/apache/arrow/pull/1730
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/tests/test_cython.py 
b/python/pyarrow/tests/test_cython.py
index df5e70ee7..57dbeb554 100644
--- a/python/pyarrow/tests/test_cython.py
+++ b/python/pyarrow/tests/test_cython.py
@@ -24,6 +24,7 @@
 
 import pyarrow as pa
 
+import pyarrow.tests.util as test_util
 
 here = os.path.dirname(os.path.abspath(__file__))
 
@@ -85,9 +86,14 @@ def test_cython_api(tmpdir):
 with open('setup.py', 'w') as f:
 f.write(setup_code)
 
+# ARROW-2263: Make environment with this pyarrow/ package first on the
+# PYTHONPATH, for local dev environments
+subprocess_env = test_util.get_modified_env_with_pythonpath()
+
 # Compile extension module
 subprocess.check_call([sys.executable, 'setup.py',
-   'build_ext', '--inplace'])
+   'build_ext', '--inplace'],
+  env=subprocess_env)
 
 # Check basic functionality
 orig_path = sys.path[:]
diff --git a/python/pyarrow/tests/test_serialization.py 
b/python/pyarrow/tests/test_serialization.py
index c17408457..64aab0671 100644
--- a/python/pyarrow/tests/test_serialization.py
+++ b/python/pyarrow/tests/test_serialization.py
@@ -28,6 +28,8 @@
 import pyarrow as pa
 import numpy as np
 
+import pyarrow.tests.util as test_util
+
 try:
 import torch
 except ImportError:
@@ -624,18 +626,6 @@ def deserialize_regex(serialized, q):
 p.join()
 
 
-def _get_modified_env_with_pythonpath():
-# Prepend pyarrow root directory to PYTHONPATH
-env = os.environ.copy()
-existing_pythonpath = env.get('PYTHONPATH', '')
-
-module_path = os.path.abspath(
-os.path.dirname(os.path.dirname(pa.__file__)))
-
-env['PYTHONPATH'] = os.pathsep.join((module_path, existing_pythonpath))
-return env
-
-
 def test_deserialize_buffer_in_different_process():
 import tempfile
 import subprocess
@@ -645,7 +635,7 @@ def test_deserialize_buffer_in_different_process():
 f.write(b.to_pybytes())
 f.close()
 
-subprocess_env = _get_modified_env_with_pythonpath()
+subprocess_env = test_util.get_modified_env_with_pythonpath()
 
 dir_path = os.path.dirname(os.path.realpath(__file__))
 python_file = os.path.join(dir_path, 'deserialize_buffer.py')
diff --git a/python/pyarrow/tests/util.py b/python/pyarrow/tests/util.py
index a3ba9000c..8c8d23b0c 100644
--- a/python/pyarrow/tests/util.py
+++ b/python/pyarrow/tests/util.py
@@ -19,9 +19,12 @@
 Utility functions for testing
 """
 
+import contextlib
 import decimal
+import os
 import random
-import contextlib
+
+import pyarrow as pa
 
 
 def randsign():
@@ -91,3 +94,15 @@ def randdecimal(precision, scale):
 return decimal.Decimal(
 '{}.{}'.format(whole, str(fractional).rjust(scale, '0'))
 )
+
+
+def get_modified_env_with_pythonpath():
+# Prepend pyarrow root directory to PYTHONPATH
+env = os.environ.copy()
+existing_pythonpath = env.get('PYTHONPATH', '')
+
+module_path = os.path.abspath(
+os.path.dirname(os.path.dirname(pa.__file__)))
+
+env['PYTHONPATH'] = os.pathsep.join((module_path, existing_pythonpath))
+return env


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-

[jira] [Resolved] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2263.
-
Resolution: Fixed

Issue resolved by pull request 1730
[https://github.com/apache/arrow/pull/1730]

> [Python] test_cython.py fails if pyarrow is not in import path (e.g. with 
> inplace builds)
> -
>
> Key: ARROW-2263
> URL: https://issues.apache.org/jira/browse/ARROW-2263
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> see 
> {code}
> $ py.test pyarrow/tests/test_cython.py 
> = test session starts 
> =
> platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0
> rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg
> collected 1 item  
> 
> pyarrow/tests/test_cython.py F
>   [100%]
> == FAILURES 
> ===
> ___ test_cython_api 
> ___
> tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0')
> @pytest.mark.skipif(
> 'ARROW_HOME' not in os.environ,
> reason='ARROW_HOME environment variable not defined')
> def test_cython_api(tmpdir):
> """
> Basic test for the Cython API.
> """
> pytest.importorskip('Cython')
> 
> ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib')
> 
> test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default)
> 
> with tmpdir.as_cwd():
> # Set up temporary workspace
> pyx_file = 'pyarrow_cython_example.pyx'
> shutil.copyfile(os.path.join(here, pyx_file),
> os.path.join(str(tmpdir), pyx_file))
> # Create setup.py file
> if os.name == 'posix':
> compiler_opts = ['-std=c++11']
> else:
> compiler_opts = []
> setup_code = setup_template.format(pyx_file=pyx_file,
>compiler_opts=compiler_opts,
>test_ld_path=test_ld_path)
> with open('setup.py', 'w') as f:
> f.write(setup_code)
> 
> # Compile extension module
> subprocess.check_call([sys.executable, 'setup.py',
> >  'build_ext', '--inplace'])
> pyarrow/tests/test_cython.py:90: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _
> popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace'],)
> kwargs = {}, retcode = 1
> cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 
> 'build_ext', '--inplace']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> 
> The arguments are the same as for the call function.  Example:
> 
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', 
> '--inplace']' returned non-zero exit status 1.
> ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: 
> CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "setup.py", line 7, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> == 1 failed in 0.23 seconds 
> ===
> {code}
> I encountered this bit of brittleness in a fresh install where I had not run 
> {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1940.
-
Resolution: Fixed

Issue resolved by pull request 1728
[https://github.com/apache/arrow/pull/1728]

> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392375#comment-16392375
 ] 

ASF GitHub Bot commented on ARROW-1940:
---

wesm commented on a change in pull request #1728: ARROW-1940: [Python] Extra 
metadata gets added after multiple conversions between pd.DataFrame and pa.Table
URL: https://github.com/apache/arrow/pull/1728#discussion_r173361703
 
 

 ##
 File path: cpp/src/arrow/python/helpers.cc
 ##
 @@ -116,7 +116,8 @@ static Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* p
   DCHECK_NE(scale, NULLPTR);
 
   // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", ""));
+  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, 
const_cast("as_tuple"),
+const_cast("")));
 
 Review comment:
   see also the `cpp_PyObject_CallMethod` wrapper for this issue in io.cc


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392377#comment-16392377
 ] 

ASF GitHub Bot commented on ARROW-1940:
---

wesm closed pull request #1728: ARROW-1940: [Python] Extra metadata gets added 
after multiple conversions between pd.DataFrame and pa.Table
URL: https://github.com/apache/arrow/pull/1728
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc
index 429068dd1..13dcc4661 100644
--- a/cpp/src/arrow/python/helpers.cc
+++ b/cpp/src/arrow/python/helpers.cc
@@ -116,7 +116,8 @@ static Status InferDecimalPrecisionAndScale(PyObject* 
python_decimal, int32_t* p
   DCHECK_NE(scale, NULLPTR);
 
   // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a 
DCHECK
-  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", ""));
+  OwnedRef as_tuple(PyObject_CallMethod(python_decimal, 
const_cast("as_tuple"),
+const_cast("")));
   RETURN_IF_PYERROR();
   DCHECK(PyTuple_Check(as_tuple.obj()));
 
@@ -241,7 +242,8 @@ bool PyDecimal_Check(PyObject* obj) {
 
 bool PyDecimal_ISNAN(PyObject* obj) {
   DCHECK(PyDecimal_Check(obj)) << "obj is not an instance of decimal.Decimal";
-  OwnedRef is_nan(PyObject_CallMethod(obj, "is_nan", ""));
+  OwnedRef is_nan(
+  PyObject_CallMethod(obj, const_cast("is_nan"), 
const_cast("")));
   return PyObject_IsTrue(is_nan.obj()) == 1;
 }
 
diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index 0bc47fc0d..97ea51d7e 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -18,6 +18,7 @@
 import ast
 import collections
 import json
+import operator
 import re
 
 import pandas.core.internals as _int
@@ -99,8 +100,8 @@ def get_logical_type(arrow_type):
 np.float32: 'float32',
 np.float64: 'float64',
 'datetime64[D]': 'date',
-np.str_: 'unicode',
-np.bytes_: 'bytes',
+np.unicode_: 'string' if not PY2 else 'unicode',
+np.bytes_: 'bytes' if not PY2 else 'string',
 }
 
 
@@ -615,6 +616,22 @@ def table_to_blockmanager(options, table, memory_pool, 
nthreads=1,
 
 
 def _backwards_compatible_index_name(raw_name, logical_name):
+"""Compute the name of an index column that is compatible with older
+versions of :mod:`pyarrow`.
+
+Parameters
+--
+raw_name : str
+logical_name : str
+
+Returns
+---
+result : str
+
+Notes
+-
+* Part of :func:`~pyarrow.pandas_compat.table_to_blockmanager`
+"""
 # Part of table_to_blockmanager
 pattern = r'^__index_level_\d+__$'
 if raw_name == logical_name and re.match(pattern, raw_name) is not None:
@@ -623,8 +640,57 @@ def _backwards_compatible_index_name(raw_name, 
logical_name):
 return logical_name
 
 
+_pandas_logical_type_map = {
+'date': 'datetime64[D]',
+'unicode': np.unicode_,
+'bytes': np.bytes_,
+'string': np.str_,
+'empty': np.object_,
+'mixed': np.object_,
+}
+
+
+def _pandas_type_to_numpy_type(pandas_type):
+"""Get the numpy dtype that corresponds to a pandas type.
+
+Parameters
+--
+pandas_type : str
+The result of a call to pandas.lib.infer_dtype.
+
+Returns
+---
+dtype : np.dtype
+The dtype that corresponds to `pandas_type`.
+"""
+try:
+return _pandas_logical_type_map[pandas_type]
+except KeyError:
+return np.dtype(pandas_type)
+
+
 def _reconstruct_columns_from_metadata(columns, column_indexes):
-# Part of table_to_blockmanager
+"""Construct a pandas MultiIndex from `columns` and column index metadata
+in `column_indexes`.
+
+Parameters
+--
+columns : List[pd.Index]
+The columns coming from a pyarrow.Table
+column_indexes : List[Dict[str, str]]
+The column index metadata deserialized from the JSON schema metadata
+in a :class:`~pyarrow.Table`.
+
+Returns
+---
+result : MultiIndex
+The index reconstructed using `column_indexes` metadata with levels of
+the correct type.
+
+Notes
+-
+* Part of :func:`~pyarrow.pandas_compat.table_to_blockmanager`
+"""
 
 # Get levels and labels, and provide sane defaults if the index has a
 # single level to avoid if/else spaghetti.
@@ -635,21 +701,28 @@ def _reconstruct_columns_from_metadata(columns, 
column_indexes):
 
 # Convert each level to the dtype provided in the metadata
 levels_dtypes = [
-(level, col_index.get('numpy_type', level.dtype))
+(level, col_index.get('pandas_type', str(level.

[jira] [Updated] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2018-03-08 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1940:
--
Labels: pull-request-available  (was: )

> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2289.
-
Resolution: Fixed

Issue resolved by pull request 1726
[https://github.com/apache/arrow/pull/1726]

> [GLib] Add  Numeric, Integer and FloatingPoint data types
> -
>
> Key: ARROW-2289
> URL: https://issues.apache.org/jira/browse/ARROW-2289
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Affects Versions: 0.8.0
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392386#comment-16392386
 ] 

ASF GitHub Bot commented on ARROW-2289:
---

wesm closed pull request #1726: ARROW-2289: [GLib] Add Numeric, Integer, 
FloatingPoint data types
URL: https://github.com/apache/arrow/pull/1726
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/c_glib/arrow-glib/basic-data-type.cpp 
b/c_glib/arrow-glib/basic-data-type.cpp
index a5f7aed1b..82abfa35c 100644
--- a/c_glib/arrow-glib/basic-data-type.cpp
+++ b/c_glib/arrow-glib/basic-data-type.cpp
@@ -315,9 +315,39 @@ garrow_boolean_data_type_new(void)
 }
 
 
+G_DEFINE_ABSTRACT_TYPE(GArrowNumericDataType,\
+   garrow_numeric_data_type, \
+   GARROW_TYPE_FIXED_WIDTH_DATA_TYPE)
+
+static void
+garrow_numeric_data_type_init(GArrowNumericDataType *object)
+{
+}
+
+static void
+garrow_numeric_data_type_class_init(GArrowNumericDataTypeClass *klass)
+{
+}
+
+
+G_DEFINE_ABSTRACT_TYPE(GArrowIntegerDataType,\
+   garrow_integer_data_type, \
+   GARROW_TYPE_NUMERIC_DATA_TYPE)
+
+static void
+garrow_integer_data_type_init(GArrowIntegerDataType *object)
+{
+}
+
+static void
+garrow_integer_data_type_class_init(GArrowIntegerDataTypeClass *klass)
+{
+}
+
+
 G_DEFINE_TYPE(GArrowInt8DataType,\
   garrow_int8_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_int8_data_type_init(GArrowInt8DataType *object)
@@ -349,7 +379,7 @@ garrow_int8_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowUInt8DataType,\
   garrow_uint8_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_uint8_data_type_init(GArrowUInt8DataType *object)
@@ -381,7 +411,7 @@ garrow_uint8_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowInt16DataType,\
   garrow_int16_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_int16_data_type_init(GArrowInt16DataType *object)
@@ -413,7 +443,7 @@ garrow_int16_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowUInt16DataType,\
   garrow_uint16_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_uint16_data_type_init(GArrowUInt16DataType *object)
@@ -445,7 +475,7 @@ garrow_uint16_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowInt32DataType,\
   garrow_int32_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_int32_data_type_init(GArrowInt32DataType *object)
@@ -477,7 +507,7 @@ garrow_int32_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowUInt32DataType,\
   garrow_uint32_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_uint32_data_type_init(GArrowUInt32DataType *object)
@@ -509,7 +539,7 @@ garrow_uint32_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowInt64DataType,\
   garrow_int64_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_int64_data_type_init(GArrowInt64DataType *object)
@@ -541,7 +571,7 @@ garrow_int64_data_type_new(void)
 
 G_DEFINE_TYPE(GArrowUInt64DataType,\
   garrow_uint64_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_INTEGER_DATA_TYPE)
 
 static void
 garrow_uint64_data_type_init(GArrowUInt64DataType *object)
@@ -571,9 +601,24 @@ garrow_uint64_data_type_new(void)
 }
 
 
+G_DEFINE_ABSTRACT_TYPE(GArrowFloatingPointDataType,\
+   garrow_floating_point_data_type,\
+   GARROW_TYPE_NUMERIC_DATA_TYPE)
+
+static void
+garrow_floating_point_data_type_init(GArrowFloatingPointDataType *object)
+{
+}
+
+static void
+garrow_floating_point_data_type_class_init(GArrowFloatingPointDataTypeClass 
*klass)
+{
+}
+
+
 G_DEFINE_TYPE(GArrowFloatDataType,\
   garrow_float_data_type, \
-  GARROW_TYPE_DATA_TYPE)
+  GARROW_TYPE_FLOATING_POINT_DATA_TYPE)
 
 static void
 garrow_float_data_type_init(GArrowFloatDataType *object)
@@ -605,7 +650,7 @@ garrow_float_data_type_new(void)
 
 G

[jira] [Created] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-08 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2292:
---

 Summary: [Python] More consistent / intuitive name for 
pyarrow.frombuffer
 Key: ARROW-2292
 URL: https://issues.apache.org/jira/browse/ARROW-2292
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
call {{from_buffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2292:

Description: Now that we have {{pyarrow.foreign_buffer}}, things are a bit 
odd. We could call {{frombuffer}} something like {{py_buffer}} instead?  (was: 
Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
call {{from_buffer}} something like {{py_buffer}} instead?)

> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-08 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392389#comment-16392389
 ] 

Wes McKinney commented on ARROW-2292:
-

cc [~pitrou]

> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392395#comment-16392395
 ] 

ASF GitHub Bot commented on ARROW-2270:
---

wesm closed pull request #1714: ARROW-2270: [Python] Fix lifetime of 
ForeignBuffer base object
URL: https://github.com/apache/arrow/pull/1714
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/io.cc b/cpp/src/arrow/python/io.cc
index 801a32574..36c193dbf 100644
--- a/cpp/src/arrow/python/io.cc
+++ b/cpp/src/arrow/python/io.cc
@@ -216,5 +216,19 @@ Status PyOutputStream::Write(const void* data, int64_t 
nbytes) {
   return file_->Write(data, nbytes);
 }
 
+// --
+// Foreign buffer
+
+Status PyForeignBuffer::Make(const uint8_t* data, int64_t size, PyObject* base,
+ std::shared_ptr* out) {
+  PyForeignBuffer* buf = new PyForeignBuffer(data, size, base);
+  if (buf == NULL) {
+return Status::OutOfMemory("could not allocate foreign buffer object");
+  } else {
+*out = std::shared_ptr(buf);
+return Status::OK();
+  }
+}
+
 }  // namespace py
 }  // namespace arrow
diff --git a/cpp/src/arrow/python/io.h b/cpp/src/arrow/python/io.h
index 696055610..5c76fe9fe 100644
--- a/cpp/src/arrow/python/io.h
+++ b/cpp/src/arrow/python/io.h
@@ -81,6 +81,27 @@ class ARROW_EXPORT PyOutputStream : public io::OutputStream {
 
 // TODO(wesm): seekable output files
 
+// A Buffer subclass that keeps a PyObject reference throughout its
+// lifetime, such that the Python object is kept alive as long as the
+// C++ buffer is still needed.
+// Keeping the reference in a Python wrapper would be incorrect as
+// the Python wrapper can get destroyed even though the wrapped C++
+// buffer is still alive (ARROW-2270).
+class ARROW_EXPORT PyForeignBuffer : public Buffer {
+ public:
+  static Status Make(const uint8_t* data, int64_t size, PyObject* base,
+ std::shared_ptr* out);
+
+ private:
+  PyForeignBuffer(const uint8_t* data, int64_t size, PyObject* base)
+  : Buffer(data, size) {
+Py_INCREF(base);
+base_.reset(base);
+  }
+
+  OwnedRefNoGIL base_;
+};
+
 }  // namespace py
 }  // namespace arrow
 
diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst
index a71e92b0b..3db1a04b6 100644
--- a/python/doc/source/api.rst
+++ b/python/doc/source/api.rst
@@ -186,6 +186,7 @@ Tables and Record Batches
 
column
chunked_array
+   concat_tables
ChunkedArray
Column
RecordBatch
@@ -213,6 +214,7 @@ Input / Output and Shared Memory
compress
decompress
frombuffer
+   foreign_buffer
Buffer
ResizableBuffer
BufferReader
diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py
index 28ac98ea0..225dfd0b2 100644
--- a/python/pyarrow/__init__.py
+++ b/python/pyarrow/__init__.py
@@ -86,7 +86,7 @@ def parse_version(root):
 from pyarrow.lib import TimestampType
 
 # Buffers, allocation
-from pyarrow.lib import (Buffer, ForeignBuffer, ResizableBuffer, compress,
+from pyarrow.lib import (Buffer, ResizableBuffer, foreign_buffer, compress,
  decompress, allocate_buffer, frombuffer)
 
 from pyarrow.lib import (MemoryPool, total_allocated_bytes,
diff --git a/python/pyarrow/includes/libarrow.pxd 
b/python/pyarrow/includes/libarrow.pxd
index 456fcca36..22c39a865 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -904,6 +904,11 @@ cdef extern from "arrow/python/api.h" namespace 
"arrow::py" nogil:
 @staticmethod
 CStatus FromPyObject(object obj, shared_ptr[CBuffer]* out)
 
+cdef cppclass PyForeignBuffer(CBuffer):
+@staticmethod
+CStatus Make(const uint8_t* data, int64_t size, object base,
+ shared_ptr[CBuffer]* out)
+
 cdef cppclass PyReadableFile(RandomAccessFile):
 PyReadableFile(object fo)
 
diff --git a/python/pyarrow/io.pxi b/python/pyarrow/io.pxi
index 611c8a86d..15ecd0164 100644
--- a/python/pyarrow/io.pxi
+++ b/python/pyarrow/io.pxi
@@ -726,18 +726,6 @@ cdef class Buffer:
 return self.size
 
 
-cdef class ForeignBuffer(Buffer):
-
-def __init__(self, addr, size, base):
-cdef:
-intptr_t c_addr = addr
-int64_t c_size = size
-self.base = base
-cdef shared_ptr[CBuffer] buffer = make_shared[CBuffer](
-c_addr, c_size)
-self.init( buffer)
-
-
 cdef class ResizableBuffer(Buffer):
 
 cdef void init_rz(self, const shared_ptr[CResizableBuffer]& buffer):
@@ -861,6 +849,21 @@ def frombuffer(object obj):
 return pyarrow_wrap_buffer(buf)
 
 
+def for

[jira] [Resolved] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime

2018-03-08 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2270.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1714
[https://github.com/apache/arrow/pull/1714]

> [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer 
> lifetime
> 
>
> Key: ARROW-2270
> URL: https://issues.apache.org/jira/browse/ARROW-2270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{ForeignBuffer}} keeps the reference to the Python base object in the Python 
> wrapper class, not in the C++ buffer instance, meaning if the C++ buffer gets 
> passed around but the Python wrapper gets destroyed, the reference to the 
> original Python base object will be released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392393#comment-16392393
 ] 

ASF GitHub Bot commented on ARROW-2270:
---

wesm commented on issue #1714: ARROW-2270: [Python] Fix lifetime of 
ForeignBuffer base object
URL: https://github.com/apache/arrow/pull/1714#issuecomment-371710696
 
 
   I added this new function to the API documentation. Merging


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer 
> lifetime
> 
>
> Key: ARROW-2270
> URL: https://issues.apache.org/jira/browse/ARROW-2270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{ForeignBuffer}} keeps the reference to the Python base object in the Python 
> wrapper class, not in the C++ buffer instance, meaning if the C++ buffer gets 
> passed around but the Python wrapper gets destroyed, the reference to the 
> original Python base object will be released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1535) [Python] Enable sdist source tarballs to build assuming that Arrow C++ libraries are available on the host system

2018-03-08 Thread Kouhei Sutou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392403#comment-16392403
 ] 

Kouhei Sutou commented on ARROW-1535:
-

I've confirmed that this works well on master:

 

% python3 setup.py sdist

% pip3 install dist/pyarrow-*.tar.gz

% python3 -c 'import pyarrow'

> [Python] Enable sdist source tarballs to build assuming that Arrow C++ 
> libraries are available on the host system
> -
>
> Key: ARROW-1535
> URL: https://issues.apache.org/jira/browse/ARROW-1535
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: Build, pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)