[jira] [Created] (ARROW-2301) [Python] Add source distribution publishing instructions to package / release management documentation

2018-03-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2301:
---

 Summary: [Python] Add source distribution publishing instructions 
to package / release management documentation
 Key: ARROW-2301
 URL: https://issues.apache.org/jira/browse/ARROW-2301
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


We wish to start publishing source tarballs for Python on PyPI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396542#comment-16396542
 ] 

Wes McKinney commented on ARROW-2227:
-

This seems to be an off-by-one-error. In builder.cc, we are comparing with 
{{INT32_MAX - 1}}, in python/numpy_to_arrow.cc we are comparing with 
{{INT32_MAX}}. I made this a blocker for 0.9.0 as I think we can fix this by 
changing the bound in numpy_to_arrow.cc. Getting late here tonight so I will 
try to fix tomorrow morning before we cut an RC -- the test case here is pretty 
swappy, we should be able to construct a better test case with some very large 
strings followed by some length-1 strings to hit the edge case at INT32_MAX

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2227:

Priority: Blocker  (was: Major)

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396530#comment-16396530
 ] 

ASF GitHub Bot commented on ARROW-2122:
---

wesm commented on issue #1707: ARROW-2122: [Python] Pyarrow fails to serialize 
dataframe with timestamp.
URL: https://github.com/apache/arrow/pull/1707#issuecomment-372543159
 
 
   This needs a little more scrutiny (cc @jreback @cpcloud @pitrou) before we 
commit to something for 0.9.0 we might have to break later. I'm going to move 
this issue off 0.9.0 so we aren't blocking


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2122:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1643:
---

Assignee: Ehsan Totoni

> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Ehsan Totoni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396525#comment-16396525
 ] 

Wes McKinney commented on ARROW-2082:
-

Moved to 0.10.0. My best guess is that this bug lies in parquet-cpp. Let's try 
to fix this soon

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2082:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.10.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396514#comment-16396514
 ] 

Wes McKinney commented on ARROW-2082:
-

Here's the backtrace for this:

{code}
#0  0x7fffece34769 in arrow::PoolBuffer::Reserve (this=0x139c180, 
capacity=1024) at ../src/arrow/buffer.cc:101
#1  0x7fffece34b2f in arrow::PoolBuffer::Resize (this=0x139c180, 
new_size=1024, shrink_to_fit=true) at ../src/arrow/buffer.cc:112
#2  0x7fffcb5fc506 in parquet::AllocateBuffer (pool=0x7fffed519300 
, size=1024) at ../src/parquet/util/memory.cc:501
#3  0x7fffcb5fc75e in parquet::InMemoryOutputStream::InMemoryOutputStream 
(this=0x1487090, pool=0x7fffed519300 , initial_capacity=1024) at 
../src/parquet/util/memory.cc:423
#4  0x7fffcb5335ca in 
parquet::PlainEncoder >::PlainEncoder 
(this=0x7fff9170, descr=0x1104060, pool=0x7fffed519300 )
at ../src/parquet/encoding-internal.h:188
#5  0x7fffcb5defa2 in 
parquet::TypedRowGroupStatistics 
>::PlainEncode (this=0xbbee60, src=@0xbbeec8: -729020189051312384, 
dst=0x7fff9258)
at ../src/parquet/statistics.cc:228
#6  0x7fffcb5def07 in 
parquet::TypedRowGroupStatistics 
>::EncodeMin (this=0xbbee60) at ../src/parquet/statistics.cc:204
#7  0x7fffcb5df1c3 in 
parquet::TypedRowGroupStatistics 
>::Encode (this=0xbbee60) at ../src/parquet/statistics.cc:219
#8  0x7fffcb5348f7 in 
parquet::TypedColumnWriter 
>::GetPageStatistics (this=0x81d2b0) at ../src/parquet/column_writer.cc:520
#9  0x7fffcb52ca76 in parquet::ColumnWriter::AddDataPage (this=0x81d2b0) at 
../src/parquet/column_writer.cc:386
#10 0x7fffcb52c0eb in parquet::ColumnWriter::FlushBufferedDataPages 
(this=0x81d2b0) at ../src/parquet/column_writer.cc:447
#11 0x7fffcb52ddb0 in parquet::ColumnWriter::Close (this=0x81d2b0) at 
../src/parquet/column_writer.cc:431
#12 0x7fffcb4d6657 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Close (this=0x7fff9b48) at 
../src/parquet/arrow/writer.cc:347
#13 0x7fffcb4e758e in parquet::arrow::FileWriter::Impl::WriteColumnChunk 
(this=0x15adee0, data=warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace'
warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace'
std::shared_ptr (count 2, weak 0) 0x1717cc0, offset=0, size=5)
at ../src/parquet/arrow/writer.cc:982
#14 0x7fffcb4d507b in parquet::arrow::FileWriter::WriteColumnChunk 
(this=0x125bc30, data=warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace'
warning: RTTI symbol not found for class 
'std::_Sp_counted_ptr_inplace'
std::shared_ptr (count 2, weak 0) 0x1717cc0, offset=0, size=5)
at ../src/parquet/arrow/writer.cc:1011
#15 0x7fffcb4d5ba6 in parquet::arrow::FileWriter::WriteTable 
(this=0x125bc30, table=..., chunk_size=5) at ../src/parquet/arrow/writer.cc:1086
{code}

Not sure what's going wrong yet

> [Python] SegFault in pyarrow.parquet.write_table with specific options
> --
>
> Key: ARROW-2082
> URL: https://issues.apache.org/jira/browse/ARROW-2082
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu 
> Xenial (Python 3.5)
>Reporter: Clément Bouscasse
>Priority: Major
> Fix For: 0.9.0
>
>
> I originally filed an issue in the pandas project but we've tracked it down 
> to arrow itself, when called via pandas in specific circumstances:
> [https://github.com/pandas-dev/pandas/issues/19493]
> basically using
> {code:java}
>  df.to_parquet('filename.parquet', flavor='spark'){code}
> gives a seg fault if `df` contains a datetime column.
> Under the covers,  pandas translates this to the following call:
> {code:java}
> pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', 
> coerce_timestamps='ms')
> {code}
> which gives me an instant crash.
> There is a repro on the github ticket.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2300) [Python] python/testing/test_hdfs.sh no longer works

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396510#comment-16396510
 ] 

Wes McKinney commented on ARROW-2300:
-

[~cpcloud] we might want to set up a minimal HDFS image that isn't dependent on 
any Impala-related docker images

> [Python] python/testing/test_hdfs.sh no longer works
> 
>
> Key: ARROW-2300
> URL: https://issues.apache.org/jira/browse/ARROW-2300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Tried this on a fresh Ubuntu 16.04 install:
> {code}
> $ ./test_hdfs.sh 
> + docker build -t arrow-hdfs-test -f hdfs/Dockerfile .
> Sending build context to Docker daemon  36.86kB
> Step 1/6 : FROM cpcloud86/impala:metastore
> manifest for cpcloud86/impala:metastore not found
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2300) [Python] python/testing/test_hdfs.sh no longer works

2018-03-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2300:
---

 Summary: [Python] python/testing/test_hdfs.sh no longer works
 Key: ARROW-2300
 URL: https://issues.apache.org/jira/browse/ARROW-2300
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


Tried this on a fresh Ubuntu 16.04 install:

{code}
$ ./test_hdfs.sh 
+ docker build -t arrow-hdfs-test -f hdfs/Dockerfile .
Sending build context to Docker daemon  36.86kB
Step 1/6 : FROM cpcloud86/impala:metastore
manifest for cpcloud86/impala:metastore not found
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396506#comment-16396506
 ] 

ASF GitHub Bot commented on ARROW-1643:
---

wesm commented on issue #1668: ARROW-1643: [Python] Accept hdfs:// prefixes in 
parquet.read_table and attempt to connect to HDFS
URL: https://github.com/apache/arrow/pull/1668#issuecomment-372538988
 
 
   +1. I fixed the Windows problem. Let me see if the HDFS docker setup still 
works while the build is running


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1425:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Li Jin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2142.
-
Resolution: Fixed

Issue resolved by pull request 1635
[https://github.com/apache/arrow/pull/1635]

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement  conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396433#comment-16396433
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm closed pull request #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc
index bda1946c6..ad2335fa1 100644
--- a/cpp/src/arrow/array-test.cc
+++ b/cpp/src/arrow/array-test.cc
@@ -3256,4 +3256,68 @@ TEST_P(DecimalTest, WithNulls) {
 
 INSTANTIATE_TEST_CASE_P(DecimalTest, DecimalTest, ::testing::Range(1, 38));
 
+// --
+// Test rechunking
+
+TEST(TestRechunkArraysConsistently, Trivial) {
+  std::vector groups, rechunked;
+  rechunked = internal::RechunkArraysConsistently(groups);
+  ASSERT_EQ(rechunked.size(), 0);
+
+  std::shared_ptr a1, a2, b1;
+  ArrayFromVector({}, );
+  ArrayFromVector({}, );
+  ArrayFromVector({}, );
+
+  groups = {{a1, a2}, {}, {b1}};
+  rechunked = internal::RechunkArraysConsistently(groups);
+  ASSERT_EQ(rechunked.size(), 3);
+}
+
+TEST(TestRechunkArraysConsistently, Plain) {
+  std::shared_ptr expected;
+  std::shared_ptr a1, a2, a3, b1, b2, b3, b4;
+  ArrayFromVector({1, 2, 3}, );
+  ArrayFromVector({4, 5}, );
+  ArrayFromVector({6, 7, 8, 9}, );
+
+  ArrayFromVector({41, 42}, );
+  ArrayFromVector({43, 44, 45}, );
+  ArrayFromVector({46, 47}, );
+  ArrayFromVector({48, 49}, );
+
+  ArrayVector a{a1, a2, a3};
+  ArrayVector b{b1, b2, b3, b4};
+
+  std::vector groups{a, b}, rechunked;
+  rechunked = internal::RechunkArraysConsistently(groups);
+  ASSERT_EQ(rechunked.size(), 2);
+  auto ra = rechunked[0];
+  auto rb = rechunked[1];
+
+  ASSERT_EQ(ra.size(), 5);
+  ArrayFromVector({1, 2}, );
+  ASSERT_ARRAYS_EQUAL(*ra[0], *expected);
+  ArrayFromVector({3}, );
+  ASSERT_ARRAYS_EQUAL(*ra[1], *expected);
+  ArrayFromVector({4, 5}, );
+  ASSERT_ARRAYS_EQUAL(*ra[2], *expected);
+  ArrayFromVector({6, 7}, );
+  ASSERT_ARRAYS_EQUAL(*ra[3], *expected);
+  ArrayFromVector({8, 9}, );
+  ASSERT_ARRAYS_EQUAL(*ra[4], *expected);
+
+  ASSERT_EQ(rb.size(), 5);
+  ArrayFromVector({41, 42}, );
+  ASSERT_ARRAYS_EQUAL(*rb[0], *expected);
+  ArrayFromVector({43}, );
+  ASSERT_ARRAYS_EQUAL(*rb[1], *expected);
+  ArrayFromVector({44, 45}, );
+  ASSERT_ARRAYS_EQUAL(*rb[2], *expected);
+  ArrayFromVector({46, 47}, );
+  ASSERT_ARRAYS_EQUAL(*rb[3], *expected);
+  ArrayFromVector({48, 49}, );
+  ASSERT_ARRAYS_EQUAL(*rb[4], *expected);
+}
+
 }  // namespace arrow
diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc
index 83142dfef..bd2b40c1a 100644
--- a/cpp/src/arrow/array.cc
+++ b/cpp/src/arrow/array.cc
@@ -20,6 +20,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 
@@ -752,6 +754,85 @@ std::shared_ptr MakeArray(const 
std::shared_ptr& data) {
   return out;
 }
 
+// --
+// Misc APIs
+
+namespace internal {
+
+std::vector RechunkArraysConsistently(
+const std::vector& groups) {
+  if (groups.size() <= 1) {
+return groups;
+  }
+  int64_t total_length = 0;
+  for (const auto& array : groups.front()) {
+total_length += array->length();
+  }
+#ifndef NDEBUG
+  for (const auto& group : groups) {
+int64_t group_length = 0;
+for (const auto& array : group) {
+  group_length += array->length();
+}
+DCHECK_EQ(group_length, total_length)
+<< "Array groups should have the same total number of elements";
+  }
+#endif
+  if (total_length == 0) {
+return groups;
+  }
+
+  // Set up result vectors
+  std::vector rechunked_groups(groups.size());
+
+  // Set up progress counters
+  std::vector current_arrays;
+  std::vector array_offsets;
+  for (const auto& group : groups) {
+current_arrays.emplace_back(group.cbegin());
+array_offsets.emplace_back(0);
+  }
+
+  // Scan all array vectors at once, rechunking along the way
+  int64_t start = 0;
+  while (start < total_length) {
+// First compute max possible length for next chunk
+int64_t chunk_length = std::numeric_limits::max();
+for (size_t i = 0; i < groups.size(); i++) {
+  auto& arr_it = 

[jira] [Commented] (ARROW-2299) [Go] Go language implementation

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396429#comment-16396429
 ] 

ASF GitHub Bot commented on ARROW-2299:
---

wesm commented on issue #1739: DONOTMERGE ARROW-2299: [Go] Import Go arrow 
implementation from influxdata/arrow
URL: https://github.com/apache/arrow/pull/1739#issuecomment-372526683
 
 
   Thanks @stuartcarnie! Steps from here:
   
   - [x] PMC vote to accept code donation
   - [ ] Receive software grant from InfluxData, Inc.
   - [ ] IP Clearance vote on Incubator general mailing list
   
   See http://incubator.apache.org/ip-clearance/arrow-go-library.html for 
status on the IP Clearance process


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Go] Go language implementation
> ---
>
> Key: ARROW-2299
> URL: https://issues.apache.org/jira/browse/ARROW-2299
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2299) [Go] Go language implementation

2018-03-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2299:
--
Labels: pull-request-available  (was: )

> [Go] Go language implementation
> ---
>
> Key: ARROW-2299
> URL: https://issues.apache.org/jira/browse/ARROW-2299
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Go
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2299) [Go] Go language implementation

2018-03-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2299:
---

 Summary: [Go] Go language implementation
 Key: ARROW-2299
 URL: https://issues.apache.org/jira/browse/ARROW-2299
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Go
Reporter: Wes McKinney
 Fix For: 0.10.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396203#comment-16396203
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

BryanCutler commented on issue #1689: ARROW-2141: [Python] Support variable 
length binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#issuecomment-372487846
 
 
   No problem to hold off on this for a while, seems like there are maybe some 
issues that @pitrou pointed out that need a deeper look


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395741#comment-16395741
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on issue #1635: ARROW-2142: [Python] Allow conversion from Numpy 
struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-372429670
 
 
   Having a last look at this


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement  conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395732#comment-16395732
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

wesm commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-372427204
 
 
   see ARROW-2298 for adding an option about NaN conversions


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2298) [Python] Add option to not consider NaN to be null when converting to an integer Arrow type

2018-03-12 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2298:
---

 Summary: [Python] Add option to not consider NaN to be null when 
converting to an integer Arrow type
 Key: ARROW-2298
 URL: https://issues.apache.org/jira/browse/ARROW-2298
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


Follow-on work to ARROW-2135



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2135.
-
Resolution: Fixed

Issue resolved by pull request 1681
[https://github.com/apache/arrow/pull/1681]

> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395731#comment-16395731
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

wesm closed pull request #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/numpy-internal.h 
b/cpp/src/arrow/python/numpy-internal.h
index 8d4308065..7672861d4 100644
--- a/cpp/src/arrow/python/numpy-internal.h
+++ b/cpp/src/arrow/python/numpy-internal.h
@@ -68,6 +68,9 @@ class Ndarray1DIndexer {
   int64_t stride_;
 };
 
+// Handling of Numpy Types by their static numbers
+// (the NPY_TYPES enum and related defines)
+
 static inline std::string GetNumPyTypeName(int npy_type) {
 #define TYPE_CASE(TYPE, NAME) \
   case NPY_##TYPE:\
@@ -79,14 +82,20 @@ static inline std::string GetNumPyTypeName(int npy_type) {
 TYPE_CASE(INT16, "int16")
 TYPE_CASE(INT32, "int32")
 TYPE_CASE(INT64, "int64")
-#if (NPY_INT64 != NPY_LONGLONG)
+#if !NPY_INT32_IS_INT
+TYPE_CASE(INT, "intc")
+#endif
+#if !NPY_INT64_IS_LONG_LONG
 TYPE_CASE(LONGLONG, "longlong")
 #endif
 TYPE_CASE(UINT8, "uint8")
 TYPE_CASE(UINT16, "uint16")
 TYPE_CASE(UINT32, "uint32")
 TYPE_CASE(UINT64, "uint64")
-#if (NPY_UINT64 != NPY_ULONGLONG)
+#if !NPY_INT32_IS_INT
+TYPE_CASE(UINT, "uintc")
+#endif
+#if !NPY_INT64_IS_LONG_LONG
 TYPE_CASE(ULONGLONG, "ulonglong")
 #endif
 TYPE_CASE(FLOAT16, "float16")
@@ -100,9 +109,48 @@ static inline std::string GetNumPyTypeName(int npy_type) {
   }
 
 #undef TYPE_CASE
-  return "unrecognized type in GetNumPyTypeName";
+  std::stringstream ss;
+  ss << "unrecognized type (" << npy_type << ") in GetNumPyTypeName";
+  return ss.str();
 }
 
+#define TYPE_VISIT_INLINE(TYPE) \
+  case NPY_##TYPE:  \
+return visitor->template Visit(arr);
+
+template 
+inline Status VisitNumpyArrayInline(PyArrayObject* arr, VISITOR* visitor) {
+  switch (PyArray_TYPE(arr)) {
+TYPE_VISIT_INLINE(BOOL);
+TYPE_VISIT_INLINE(INT8);
+TYPE_VISIT_INLINE(UINT8);
+TYPE_VISIT_INLINE(INT16);
+TYPE_VISIT_INLINE(UINT16);
+TYPE_VISIT_INLINE(INT32);
+TYPE_VISIT_INLINE(UINT32);
+TYPE_VISIT_INLINE(INT64);
+TYPE_VISIT_INLINE(UINT64);
+#if !NPY_INT32_IS_INT
+TYPE_VISIT_INLINE(INT);
+TYPE_VISIT_INLINE(UINT);
+#endif
+#if !NPY_INT64_IS_LONG_LONG
+TYPE_VISIT_INLINE(LONGLONG);
+TYPE_VISIT_INLINE(ULONGLONG);
+#endif
+TYPE_VISIT_INLINE(FLOAT16);
+TYPE_VISIT_INLINE(FLOAT32);
+TYPE_VISIT_INLINE(FLOAT64);
+TYPE_VISIT_INLINE(DATETIME);
+TYPE_VISIT_INLINE(OBJECT);
+  }
+  std::stringstream ss;
+  ss << "NumPy type not implemented: " << GetNumPyTypeName(PyArray_TYPE(arr));
+  return Status::NotImplemented(ss.str());
+}
+
+#undef TYPE_VISIT_INLINE
+
 }  // namespace py
 }  // namespace arrow
 
diff --git a/cpp/src/arrow/python/numpy_interop.h 
b/cpp/src/arrow/python/numpy_interop.h
index 8c569e232..0715c66c5 100644
--- a/cpp/src/arrow/python/numpy_interop.h
+++ b/cpp/src/arrow/python/numpy_interop.h
@@ -43,6 +43,31 @@
 #include 
 #include 
 
+// A bit subtle. Numpy has 5 canonical integer types:
+// (or, rather, type pairs: signed and unsigned)
+//   NPY_BYTE, NPY_SHORT, NPY_INT, NPY_LONG, NPY_LONGLONG
+// It also has 4 fixed-width integer aliases.
+// When mapping Arrow integer types to these 4 fixed-width aliases,
+// we always miss one of the canonical types (even though it may
+// have the same width as one of the aliases).
+// Which one depends on the platform...
+// On a LP64 system, NPY_INT64 maps to NPY_LONG and
+// NPY_LONGLONG needs to be handled separately.
+// On a LLP64 system, NPY_INT32 maps to NPY_LONG and
+// NPY_INT needs to be handled separately.
+
+#if NPY_BITSOF_LONG == 32 && NPY_BITSOF_LONGLONG == 64
+#define NPY_INT64_IS_LONG_LONG 1
+#else
+#define NPY_INT64_IS_LONG_LONG 0
+#endif
+
+#if NPY_BITSOF_INT == 32 && NPY_BITSOF_LONG == 64
+#define NPY_INT32_IS_INT 1
+#else
+#define NPY_INT32_IS_INT 0
+#endif
+
 namespace arrow {
 namespace py {
 
diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 04a71c1f6..6ddc4a7be 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -84,6 +84,38 @@ inline bool PyObject_is_integer(PyObject* obj) {
   return !PyBool_Check(obj) && PyArray_IsIntegerScalar(obj);
 }
 
+Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) {
+  if (PyArray_NDIM(numpy_array) != 1) {
+return Status::Invalid("only handle 1-dimensional arrays");
+  }
+
+  const int 

[jira] [Updated] (ARROW-1974) [Python] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1974:

Summary: [Python] Segfault when writing Arrow table with duplicate columns  
(was: [Python] Segfault when working with Arrow tables with duplicate columns)

> [Python] Segfault when writing Arrow table with duplicate columns
> -
>
> Key: ARROW-1974
> URL: https://issues.apache.org/jira/browse/ARROW-1974
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1674) [Format] Add bit width metadata to Bool logical type

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395697#comment-16395697
 ] 

ASF GitHub Bot commented on ARROW-1674:
---

wesm commented on issue #1201: ARROW-1674: [Format, C++] Add support for byte 
length booleans in Tensors
URL: https://github.com/apache/arrow/pull/1201#issuecomment-372421411
 
 
   I'm closing this PR until we have a chance to address the underlying issue 
(distinguishing byte-size boolean vs uint8) in more detail


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Format] Add bit width metadata to Bool logical type
> 
>
> Key: ARROW-1674
> URL: https://issues.apache.org/jira/browse/ARROW-1674
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> Some libraries represent boolean data as a single byte per value as a vector 
> of int8/uint8 1's and 0's. It would be useful to be able to retain this 
> metadata as an optional field on the {{Bool}} table in {{Schema.fbs}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1674) [Format] Add bit width metadata to Bool logical type

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395698#comment-16395698
 ] 

ASF GitHub Bot commented on ARROW-1674:
---

wesm closed pull request #1201: ARROW-1674: [Format, C++] Add support for byte 
length booleans in Tensors
URL: https://github.com/apache/arrow/pull/1201
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc
index 2ec86c369..552e1d7ca 100644
--- a/cpp/src/arrow/compare.cc
+++ b/cpp/src/arrow/compare.cc
@@ -740,19 +740,27 @@ bool TensorEquals(const Tensor& left, const Tensor& 
right) {
 are_equal = false;
   } else {
 const auto& type = static_cast(*left.type());
+// Type::BOOL strided tensors are currently not supported
+DCHECK_GT(type.bit_width() / CHAR_BIT, 0);
 are_equal =
-StridedTensorContentEquals(0, 0, 0, type.bit_width() / 8, left, 
right);
+StridedTensorContentEquals(0, 0, 0, type.bit_width() / CHAR_BIT, 
left, right);
   }
 } else {
   const auto& size_meta = dynamic_cast(*left.type());
-  const int byte_width = size_meta.bit_width() / CHAR_BIT;
-  DCHECK_GT(byte_width, 0);
 
   const uint8_t* left_data = left.data()->data();
   const uint8_t* right_data = right.data()->data();
 
-  are_equal = memcmp(left_data, right_data,
- static_cast(byte_width * left.size())) == 0;
+  if (size_meta.bit_width() == 1) {
+int64_t bytes = (left.size() + CHAR_BIT - 1) / CHAR_BIT;
+are_equal = memcmp(left_data, right_data,
+   static_cast(bytes)) == 0;
+  } else {
+const int byte_width = size_meta.bit_width() / CHAR_BIT;
+DCHECK_GT(byte_width, 0);
+are_equal = memcmp(left_data, right_data,
+   static_cast(byte_width * left.size())) == 0;
+  }
 }
   }
   return are_equal;
diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc 
b/cpp/src/arrow/ipc/ipc-read-write-test.cc
index adf34a9eb..fbbcf3dd4 100644
--- a/cpp/src/arrow/ipc/ipc-read-write-test.cc
+++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc
@@ -728,14 +728,25 @@ TEST_F(TestTensorRoundTrip, BasicRoundtrip) {
 
   std::vector values;
   test::randint(size, 0, 100, );
+  std::vector bool_values;
+  test::randbool(size, _values);
+  std::vector bool8_values;
+  test::randint(size, 0, 1, _values);
 
   auto data = test::GetBufferFromVector(values);
+  std::shared_ptr bool_data;
+  ASSERT_OK(test::GetBitmapFromVector(bool_values, _data));
+  auto bool8_data = test::GetBufferFromVector(bool8_values);
 
   Tensor t0(int64(), data, shape, strides, dim_names);
   Tensor tzero(int64(), data, {}, {}, {});
+  Tensor tbool(boolean(), bool_data, {}, {}, {});
+  Tensor tbool8(boolean8(), bool8_data, {}, {}, {});
 
   CheckTensorRoundTrip(t0);
   CheckTensorRoundTrip(tzero);
+  CheckTensorRoundTrip(tbool);
+  CheckTensorRoundTrip(tbool8);
 
   int64_t serialized_size;
   ASSERT_OK(GetTensorSize(t0, _size));
diff --git a/cpp/src/arrow/ipc/metadata-internal.cc 
b/cpp/src/arrow/ipc/metadata-internal.cc
index 162afb94b..48e23061c 100644
--- a/cpp/src/arrow/ipc/metadata-internal.cc
+++ b/cpp/src/arrow/ipc/metadata-internal.cc
@@ -249,9 +249,15 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const 
void* type_data,
 case flatbuf::Type_Utf8:
   *out = utf8();
   return Status::OK();
-case flatbuf::Type_Bool:
-  *out = boolean();
+case flatbuf::Type_Bool: {
+  auto bool_type = static_cast(type_data);
+  if (bool_type->is_byte()) {
+*out = boolean8();
+  } else {
+*out = boolean();
+  }
   return Status::OK();
+}
 case flatbuf::Type_Decimal: {
   auto dec_type = static_cast(type_data);
   *out = decimal(dec_type->precision(), dec_type->scale());
@@ -458,6 +464,14 @@ static Status TypeToFlatbuffer(FBB& fbb, const DataType& 
type,
 static Status TensorTypeToFlatbuffer(FBB& fbb, const DataType& type,
  flatbuf::Type* out_type, Offset* offset) {
   switch (type.id()) {
+case Type::BOOL:
+  *out_type = flatbuf::Type_Bool;
+  *offset = flatbuf::CreateBool(fbb).Union();
+  break;
+case Type::BOOL8:
+  *out_type = flatbuf::Type_Bool;
+  *offset = flatbuf::CreateBool(fbb, true).Union();
+  break;
 case Type::UINT8:
   INT_TO_FB_CASE(8, false);
 case Type::INT8:
diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h
index 4e4c6b8d5..cc622948f 100644
--- a/cpp/src/arrow/tensor.h
+++ b/cpp/src/arrow/tensor.h
@@ -32,6 +32,8 @@ namespace arrow {
 
 static inline bool 

[jira] [Updated] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2142:

Fix Version/s: 0.9.0

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement  conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2118.
-
Resolution: Fixed

Issue resolved by pull request 1735
[https://github.com/apache/arrow/pull/1735]

> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395685#comment-16395685
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

wesm commented on issue #1735: ARROW-2118: [C++] Fix misleading error when 
memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735#issuecomment-372419557
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395688#comment-16395688
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

wesm closed pull request #1735: ARROW-2118: [C++] Fix misleading error when 
memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc
index d44d90cbe..02cc4dbbd 100644
--- a/cpp/src/arrow/io/file.cc
+++ b/cpp/src/arrow/io/file.cc
@@ -624,16 +624,22 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer {
   is_mutable_ = false;
 }
 
-void* result = mmap(nullptr, static_cast(file_->size()), 
prot_flags, map_mode,
-file_->fd(), 0);
-if (result == MAP_FAILED) {
-  std::stringstream ss;
-  ss << "Memory mapping file failed, errno: " << errno;
-  return Status::IOError(ss.str());
+size_ = file_->size();
+
+void* result = nullptr;
+
+// Memory mapping fails when file size is 0
+if (size_ > 0) {
+  result =
+  mmap(nullptr, static_cast(size_), prot_flags, map_mode, 
file_->fd(), 0);
+  if (result == MAP_FAILED) {
+std::stringstream ss;
+ss << "Memory mapping file failed: " << std::strerror(errno);
+return Status::IOError(ss.str());
+  }
 }
 
 data_ = mutable_data_ = reinterpret_cast(result);
-size_ = file_->size();
 
 position_ = 0;
 
diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc
index 2a4acab59..53218ca85 100644
--- a/cpp/src/arrow/io/io-file-test.cc
+++ b/cpp/src/arrow/io/io-file-test.cc
@@ -467,6 +467,16 @@ class TestMemoryMappedFile : public ::testing::Test, 
public MemoryMapFixture {
 
 TEST_F(TestMemoryMappedFile, InvalidUsages) {}
 
+TEST_F(TestMemoryMappedFile, ZeroSizeFlie) {
+  std::string path = "io-memory-map-zero-size";
+  std::shared_ptr result;
+  ASSERT_OK(InitMemoryMap(0, path, ));
+
+  int64_t size = 0;
+  ASSERT_OK(result->Tell());
+  ASSERT_EQ(0, size);
+}
+
 TEST_F(TestMemoryMappedFile, WriteRead) {
   const int64_t buffer_size = 1024;
   std::vector buffer(buffer_size);
diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py
index fe680133b..8557340e0 100644
--- a/python/pyarrow/tests/test_io.py
+++ b/python/pyarrow/tests/test_io.py
@@ -592,6 +592,14 @@ def test_memory_map_writer(tmpdir):
 assert f.read(3) == b'foo'
 
 
+def test_memory_zero_length(tmpdir):
+path = os.path.join(str(tmpdir), guid())
+f = open(path, 'wb')
+f.close()
+with pa.memory_map(path, mode='r+b') as memory_map:
+assert memory_map.size() == 0
+
+
 def test_os_file_writer(tmpdir):
 SIZE = 4096
 arr = np.random.randint(0, 256, size=SIZE).astype('u1')


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395678#comment-16395678
 ] 

ASF GitHub Bot commented on ARROW-1643:
---

wesm commented on issue #1668: ARROW-1643: [Python] Accept hdfs:// prefixes in 
parquet.read_table and attempt to connect to HDFS
URL: https://github.com/apache/arrow/pull/1668#issuecomment-372418909
 
 
   I will take a look and see if we can get this into 0.9.0


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1643:

Fix Version/s: (was: 0.10.0)
   0.9.0

> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1918) [JS] Integration portion of verify-release-candidate.sh fails

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1918:

Fix Version/s: (was: JS-0.3.1)
   JS-0.4.0

> [JS] Integration portion of verify-release-candidate.sh fails
> -
>
> Key: ARROW-1918
> URL: https://issues.apache.org/jira/browse/ARROW-1918
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.4.0
>
>
> I'm going to temporarily disable this in my fixes in ARROW-1917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395662#comment-16395662
 ] 

Wes McKinney commented on ARROW-1870:
-

The LTS NodeJS release is now 8.10.0 and all is working there

> [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
> 
>
> Key: ARROW-1870
> URL: https://issues.apache.org/jira/browse/ARROW-1870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1870:

Fix Version/s: (was: JS-0.3.1)

> [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
> 
>
> Key: ARROW-1870
> URL: https://issues.apache.org/jira/browse/ARROW-1870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1501) [JS] JavaScript integration tests

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1501:

Fix Version/s: (was: JS-0.3.1)
   JS-0.4.0

> [JS] JavaScript integration tests
> -
>
> Key: ARROW-1501
> URL: https://issues.apache.org/jira/browse/ARROW-1501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.4.0
>
>
> Tracking JIRA for integration test-related issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1870.
---
Assignee: Wes McKinney

> [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
> 
>
> Key: ARROW-1870
> URL: https://issues.apache.org/jira/browse/ARROW-1870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2293) [JS] Print release vote e-mail template when making source release

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395659#comment-16395659
 ] 

ASF GitHub Bot commented on ARROW-2293:
---

wesm closed pull request #1738: ARROW-2293: [JS] Print release vote e-mail 
template when making source release
URL: https://github.com/apache/arrow/pull/1738
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/dev/release/js-source-release.sh b/dev/release/js-source-release.sh
index 53b31af62f..292869db69 100755
--- a/dev/release/js-source-release.sh
+++ b/dev/release/js-source-release.sh
@@ -21,9 +21,38 @@ set -e
 
 SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 
+function print_help_and_exit {
+cat < "
+
+  -h  Print this help message and exit
+  -p  If present, publish the release candidate (default: does not publish 
anything)
+EOF
+exit 0
+}
+
+publish=0
+while getopts ":hp" opt; do
+  case $opt in
+p)
+  publish=1
+  ;;
+h)
+  print_help_and_exit
+  ;;
+*  )
+  echo "Unknown option: -$OPTARG"
+  print_help_and_exit
+  ;;
+  esac
+done
+
+shift $(($OPTIND - 1))
+
 if [ "$#" -ne 2 ]; then
-  echo "Usage: $0  "
-  exit
+  print_help_and_exit
 fi
 
 js_version=$1
@@ -32,6 +61,17 @@ rc=$2
 tag=apache-arrow-js-${js_version}
 tagrc=${tag}-rc${rc}
 
+# Reset instructions
+current_git_rev=$(git rev-parse HEAD)
+function print_reset_instructions {
+cat < ${tarball}.sha1
 sha256sum $tarball > ${tarball}.sha256
 sha512sum $tarball > ${tarball}.sha512
 
-# check out the arrow RC folder
-svn co --depth=empty https://dist.apache.org/repos/dist/dev/arrow js-rc-tmp
+if [[ $publish == 1 ]]; then
+  # check out the arrow RC folder
+  svn co --depth=empty https://dist.apache.org/repos/dist/dev/arrow js-rc-tmp
 
-# add the release candidate for the tag
-mkdir -p js-rc-tmp/${tagrc}
-cp ${tarball}* js-rc-tmp/${tagrc}
-svn add js-rc-tmp/${tagrc}
-svn ci -m 'Apache Arrow JavaScript ${version} RC${rc}' js-rc-tmp/${tagrc}
+  # add the release candidate for the tag
+  mkdir -p js-rc-tmp/${tagrc}
+  cp ${tarball}* js-rc-tmp/${tagrc}
+  svn add js-rc-tmp/${tagrc}
+  svn ci -m 'Apache Arrow JavaScript ${version} RC${rc}' js-rc-tmp/${tagrc}
+fi
 
 cd -
 
@@ -100,3 +142,61 @@ echo "Success! The release candidate is available here:"
 echo "  https://dist.apache.org/repos/dist/dev/arrow/${tagrc};
 echo ""
 echo "Commit SHA1: ${release_hash}"
+echo ""
+echo "The following draft email has been created to send to the "
+echo "d...@arrow.apache.org mailing list"
+echo ""
+
+# Create the email template for the release candidate to be sent to the 
mailing lists.
+MESSAGE=$(cat <<__EOF__
+To: d...@arrow.apache.org
+Subject: [VOTE] Release Apache Arrow JS ${js_version} - RC${rc}
+
+Hello all,
+
+I\'d like to propose the following release candidate (rc${rc}) of Apache Arrow
+JavaScript version ${js_version}.
+
+The source release rc{$rc} is hosted at [1].
+
+This release candidate is based on commit
+${release_hash}
+
+Please download, verify checksums and signatures, run the unit tests, and vote
+on the release. The easiest way is to use the JavaScript-specific release
+verification script dev/release/js-verify-release-candidate.sh.
+
+The vote will be open for at least 72 hours and will close once
+enough PMCs have approved the release.
+
+[ ] +1 Release this as Apache Arrow JavaScript ${js_version}
+[ ] +0
+[ ] -1 Do not release this as Apache Arrow JavaScript ${js_version} because...
+
+
+How to validate a release signature:
+https://httpd.apache.org/dev/verification.html
+
+[1]: https://dist.apache.org/repos/dist/dev/arrow/${tagrc}/
+[2]: https://github.com/apache/arrow/tree/${release_hash}
+
+__EOF__
+)
+
+
+echo 
""
+echo
+echo "${MESSAGE}"
+echo
+echo 
""
+echo
+
+
+# Print reset instructions if this was a dry-run
+if [[ $publish == 0 ]]; then
+  echo
+  echo "This was a dry run, nothing has been published."
+  echo "To publish, re-run this script with the -p flag."
+  echo
+  print_reset_instructions
+fi


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] Print release vote e-mail template when making source release
> --
>
> Key: ARROW-2293
> URL: 

[jira] [Resolved] (ARROW-2293) [JS] Print release vote e-mail template when making source release

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2293.
-
   Resolution: Fixed
Fix Version/s: JS-0.3.1

Issue resolved by pull request 1738
[https://github.com/apache/arrow/pull/1738]

> [JS] Print release vote e-mail template when making source release
> --
>
> Key: ARROW-2293
> URL: https://issues.apache.org/jira/browse/ARROW-2293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.3.1
>
>
> This would help with streamlining the source release process. See 
> https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate
>  for an example



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2293) [JS] Print release vote e-mail template when making source release

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395657#comment-16395657
 ] 

ASF GitHub Bot commented on ARROW-2293:
---

wesm commented on issue #1738: ARROW-2293: [JS] Print release vote e-mail 
template when making source release
URL: https://github.com/apache/arrow/pull/1738#issuecomment-372413251
 
 
   thanks @TheNeuralBit!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] Print release vote e-mail template when making source release
> --
>
> Key: ARROW-2293
> URL: https://issues.apache.org/jira/browse/ARROW-2293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> This would help with streamlining the source release process. See 
> https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate
>  for an example



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395650#comment-16395650
 ] 

ASF GitHub Bot commented on ARROW-2282:
---

wesm commented on issue #1720: ARROW-2282: [Python] Create StringArray from 
buffers
URL: https://github.com/apache/arrow/pull/1720#issuecomment-372412839
 
 
   +1, thanks @xhochy!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StringArray from buffers
> 
>
> Key: ARROW-2282
> URL: https://issues.apache.org/jira/browse/ARROW-2282
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> While we will add a more general-purpose functionality in 
> https://issues.apache.org/jira/browse/ARROW-2281, the interface is more 
> complicate then the constructor that explicitly states all arguments:  
> {{StringArray(int64_t length, const std::shared_ptr& value_offsets, 
> …}}
> Thus I will also expose this explicit constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395652#comment-16395652
 ] 

ASF GitHub Bot commented on ARROW-2282:
---

wesm closed pull request #1720: ARROW-2282: [Python] Create StringArray from 
buffers
URL: https://github.com/apache/arrow/pull/1720
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index e785c0ec5c..1e6bc22d39 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -774,8 +774,41 @@ cdef class UnionArray(Array):
 return pyarrow_wrap_array(out)
 
 cdef class StringArray(Array):
-pass
 
+@staticmethod
+def from_buffers(int length, Buffer value_offsets, Buffer data,
+ Buffer null_bitmap=None, int null_count=-1,
+ int offset=0):
+"""
+Construct a StringArray from value_offsets and data buffers.
+If there are nulls in the data, also a null_bitmap and the matching
+null_count must be passed.
+
+Parameters
+--
+length : int
+value_offsets : Buffer
+data : Buffer
+null_bitmap : Buffer, optional
+null_count : int, default 0
+offset : int, default 0
+
+Returns
+---
+string_array : StringArray
+"""
+cdef shared_ptr[CBuffer] c_null_bitmap
+cdef shared_ptr[CArray] out
+
+if null_bitmap is not None:
+c_null_bitmap = null_bitmap.buffer
+else:
+null_count = 0
+
+out.reset(new CStringArray(
+length, value_offsets.buffer, data.buffer, c_null_bitmap,
+null_count, offset))
+return pyarrow_wrap_array(out)
 
 cdef class BinaryArray(Array):
 pass
diff --git a/python/pyarrow/includes/libarrow.pxd 
b/python/pyarrow/includes/libarrow.pxd
index 456fcca360..09a6065bcd 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -367,6 +367,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil:
 const uint8_t* GetValue(int i, int32_t* length)
 
 cdef cppclass CStringArray" arrow::StringArray"(CBinaryArray):
+CStringArray(int64_t length, shared_ptr[CBuffer] value_offsets,
+ shared_ptr[CBuffer] data,
+ shared_ptr[CBuffer] null_bitmap,
+ int64_t null_count,
+ int64_t offset)
+
 c_string GetString(int i)
 
 cdef cppclass CStructArray" arrow::StructArray"(CArray):
diff --git a/python/pyarrow/tests/test_array.py 
b/python/pyarrow/tests/test_array.py
index f034d78b39..c710f7cdbe 100644
--- a/python/pyarrow/tests/test_array.py
+++ b/python/pyarrow/tests/test_array.py
@@ -258,6 +258,36 @@ def test_union_from_sparse():
 assert result.to_pylist() == [b'a', 1, b'b', b'c', 2, 3, b'd']
 
 
+def test_string_from_buffers():
+array = pa.array(["a", None, "b", "c"])
+
+buffers = array.buffers()
+copied = pa.StringArray.from_buffers(
+len(array), buffers[1], buffers[2], buffers[0], array.null_count,
+array.offset)
+assert copied.to_pylist() == ["a", None, "b", "c"]
+
+copied = pa.StringArray.from_buffers(
+len(array), buffers[1], buffers[2], buffers[0])
+assert copied.to_pylist() == ["a", None, "b", "c"]
+
+sliced = array[1:]
+buffers = sliced.buffers()
+copied = pa.StringArray.from_buffers(
+len(sliced), buffers[1], buffers[2], buffers[0], -1, sliced.offset)
+assert copied.to_pylist() == [None, "b", "c"]
+assert copied.null_count == 1
+
+# Slice but exclude all null entries so that we don't need to pass
+# the null bitmap.
+sliced = array[2:]
+buffers = sliced.buffers()
+copied = pa.StringArray.from_buffers(
+len(sliced), buffers[1], buffers[2], None, -1, sliced.offset)
+assert copied.to_pylist() == ["b", "c"]
+assert copied.null_count == 0
+
+
 def _check_cast_case(case, safe=True):
 in_data, in_type, out_data, out_type = case
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StringArray from buffers
> 
>
> Key: ARROW-2282
> URL: https://issues.apache.org/jira/browse/ARROW-2282
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>

[jira] [Updated] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2227:

Fix Version/s: (was: 0.10.0)
   0.9.0

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2292.
-
Resolution: Fixed

Issue resolved by pull request 1736
[https://github.com/apache/arrow/pull/1736]

> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2240) [Python] Array initialization with leading numpy nan fails with exception

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2240.
-
Resolution: Fixed

Issue resolved by pull request 1686
[https://github.com/apache/arrow/pull/1686]

> [Python] Array initialization with leading numpy nan fails with exception
> -
>
> Key: ARROW-2240
> URL: https://issues.apache.org/jira/browse/ARROW-2240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Assignee: Phillip Cloud
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
>  
> Arrow initialization fails for string arrays with leading numpy NAN
> {code:java}
> import pyarrow as pa
> import numpy as np
> pa.array([np.nan, 'str'])
> # Py3: ArrowException: Unknown error: must be real number, not str
> # Py2: ArrowException: Unknown error: a float is required{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2240) [Python] Array initialization with leading numpy nan fails with exception

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395599#comment-16395599
 ] 

ASF GitHub Bot commented on ARROW-2240:
---

wesm closed pull request #1686: ARROW-2240: [Python] Array initialization with 
leading numpy nan fails with exception
URL: https://github.com/apache/arrow/pull/1686
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/builtin_convert.cc 
b/cpp/src/arrow/python/builtin_convert.cc
index d2f900f6ae..595499de79 100644
--- a/cpp/src/arrow/python/builtin_convert.cc
+++ b/cpp/src/arrow/python/builtin_convert.cc
@@ -88,7 +88,7 @@ class ScalarVisitor {
 
   Status Visit(PyObject* obj) {
 ++total_count_;
-if (obj == Py_None) {
+if (obj == Py_None || internal::PyFloat_IsNaN(obj)) {
   ++none_count_;
 } else if (PyBool_Check(obj)) {
   ++bool_count_;
@@ -412,9 +412,10 @@ class TypedConverterVisitor : public 
TypedConverter {
 RETURN_NOT_OK(this->typed_builder_->Reserve(size));
 // Iterate over the items adding each one
 if (PySequence_Check(obj)) {
+  auto self = static_cast(this);
   for (int64_t i = 0; i < size; ++i) {
 OwnedRef ref(PySequence_GetItem(obj, i));
-RETURN_NOT_OK(static_cast(this)->AppendSingle(ref.obj()));
+RETURN_NOT_OK(self->AppendSingle(ref.obj()));
   }
 } else {
   return Status::TypeError("Object is not a sequence");
@@ -424,7 +425,8 @@ class TypedConverterVisitor : public 
TypedConverter {
 
   // Append a missing item (default implementation)
   Status AppendNull() { return this->typed_builder_->AppendNull(); }
-  bool IsNull(PyObject* obj) const { return obj == Py_None; }
+
+  bool IsNull(PyObject* obj) const { return internal::PandasObjectIsNull(obj); 
}
 };
 
 class NullConverter : public TypedConverterVisitor 
{
@@ -438,7 +440,9 @@ class NullConverter : public 
TypedConverterVisitor {
 class BoolConverter : public TypedConverterVisitor {
  public:
   // Append a non-missing item
-  Status AppendItem(PyObject* obj) { return typed_builder_->Append(obj == 
Py_True); }
+  Status AppendItem(PyObject* obj) {
+return typed_builder_->Append(PyObject_IsTrue(obj) == 1);
+  }
 };
 
 class Int8Converter : public TypedConverterVisitor 
{
@@ -851,11 +855,6 @@ class DecimalConverter
 RETURN_NOT_OK(internal::DecimalFromPythonDecimal(obj, type, ));
 return typed_builder_->Append(value);
   }
-
-  bool IsNull(PyObject* obj) const {
-return obj == Py_None || obj == numpy_nan || internal::PyFloat_isnan(obj) 
||
-   (internal::PyDecimal_Check(obj) && internal::PyDecimal_ISNAN(obj));
-  }
 };
 
 // Dynamic constructor for sequence converters
diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc
index 429068dd1a..5840401315 100644
--- a/cpp/src/arrow/python/helpers.cc
+++ b/cpp/src/arrow/python/helpers.cc
@@ -225,10 +225,6 @@ Status UInt64FromPythonInt(PyObject* obj, uint64_t* out) {
   return Status::OK();
 }
 
-bool PyFloat_isnan(PyObject* obj) {
-  return PyFloat_Check(obj) && std::isnan(PyFloat_AS_DOUBLE(obj));
-}
-
 bool PyDecimal_Check(PyObject* obj) {
   // TODO(phillipc): Is this expensive?
   OwnedRef Decimal;
@@ -281,6 +277,15 @@ Status DecimalMetadata::Update(PyObject* object) {
   return Update(precision, scale);
 }
 
+bool PyFloat_IsNaN(PyObject* obj) {
+  return PyFloat_Check(obj) && std::isnan(PyFloat_AsDouble(obj));
+}
+
+bool PandasObjectIsNull(PyObject* obj) {
+  return obj == Py_None || obj == numpy_nan || PyFloat_IsNaN(obj) ||
+ (internal::PyDecimal_Check(obj) && internal::PyDecimal_ISNAN(obj));
+}
+
 }  // namespace internal
 }  // namespace py
 }  // namespace arrow
diff --git a/cpp/src/arrow/python/helpers.h b/cpp/src/arrow/python/helpers.h
index 6be0e49b18..b9f505a160 100644
--- a/cpp/src/arrow/python/helpers.h
+++ b/cpp/src/arrow/python/helpers.h
@@ -82,8 +82,11 @@ Status DecimalFromPythonDecimal(PyObject* python_decimal, 
const DecimalType& arr
 // \brief Check whether obj is an integer, independent of Python versions.
 bool IsPyInteger(PyObject* obj);
 
+// \brief Use pandas missing value semantics to check if a value is null
+bool PandasObjectIsNull(PyObject* obj);
+
 // \brief Check whether obj is nan
-bool PyFloat_isnan(PyObject* obj);
+bool PyFloat_IsNaN(PyObject* obj);
 
 // \brief Check whether obj is an instance of Decimal
 bool PyDecimal_Check(PyObject* obj);
diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 04a71c1f64..9e3534d628 100644
--- 

[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395586#comment-16395586
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

wesm commented on a change in pull request #1735: ARROW-2118: [C++] Fix 
misleading error when memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735#discussion_r173889736
 
 

 ##
 File path: cpp/src/arrow/io/file.cc
 ##
 @@ -624,16 +624,22 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer 
{
   is_mutable_ = false;
 }
 
-void* result = mmap(nullptr, static_cast(file_->size()), 
prot_flags, map_mode,
-file_->fd(), 0);
-if (result == MAP_FAILED) {
-  std::stringstream ss;
-  ss << "Memory mapping file failed, errno: " << errno;
-  return Status::IOError(ss.str());
+size_ = file_->size();
+
+void* result = nullptr;
+
+// Memory mapping fails when file size is 0
+if (size_ > 0) {
+  result =
+  mmap(nullptr, static_cast(size_), prot_flags, map_mode, 
file_->fd(), 0);
+  if (result == MAP_FAILED) {
+std::stringstream ss;
+ss << "Memory mapping file failed, errno: " << errno;
 
 Review comment:
   OK, this code is unchanged from the original version, but I'll add this


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2296) [C++] Add num_rows to file footer

2018-03-12 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395564#comment-16395564
 ] 

Lawrence Chan edited comment on ARROW-2296 at 3/12/18 5:37 PM:
---

Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk 
all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current 
implementation is inside a .cc file and I'd have to either copy+paste or modify 
my build to expose more of the existing code. Maybe we could expose something 
like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) 
const;
{code}
Then it'd be possible to read the length fields without copying a bunch of 
code. Not sure if this is a good idea though, since it seems that we dont 
usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in 
there, and walk the batches in the current format?


was (Author: llchan):
Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk 
all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current 
implementation is inside a .cc file and I'd have to either copy+paste or modify 
my build to expose more of the existing code. Maybe we could expose something 
like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) 
const;
{code}
Then it'd be possible to read the length fields without copying some of the 
other stuff. Not sure if this is a good idea though, since it seems that we 
dont usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in 
there, and walk the batches in the current format?

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2293) [JS] Print release vote e-mail template when making source release

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395544#comment-16395544
 ] 

ASF GitHub Bot commented on ARROW-2293:
---

wesm commented on a change in pull request #1738: ARROW-2293: [JS] Print 
release vote e-mail template when making source release
URL: https://github.com/apache/arrow/pull/1738#discussion_r173878832
 
 

 ##
 File path: dev/release/js-source-release.sh
 ##
 @@ -100,3 +142,61 @@ echo "Success! The release candidate is available here:"
 echo "  https://dist.apache.org/repos/dist/dev/arrow/${tagrc};
 echo ""
 echo "Commit SHA1: ${release_hash}"
+echo ""
+echo "The following draft email has been created to send to the "
+echo "d...@arrow.apache.org mailing list"
+echo ""
+
+# Create the email template for the release candidate to be sent to the 
mailing lists.
+MESSAGE=$(cat <<__EOF__
+To: d...@arrow.apache.org
+Subject: [VOTE] Release Apache Arrow JS ${js_version} - RC${rc}
+
+Hello all,
+
+I\'d like to propose the following release candidate (rc${rc}) of Apache Arrow
+JavaScript version ${js_version}.
+
+The source release rc{$rc} is hosted at [1].
+
+This release candidate is based on commit
+${release_hash}
+
+Please download, verify checksums and signatures, run the unit tests, and vote
+on the release. The easiest way is to use the JavaScript-specific release
+verification script dev/release/js-verify-release-candidate.sh.
+
+The vote will be open for at least 24 hours and will close once
+enough PMCs have approved the release.
 
 Review comment:
   We should change the time window in the template to the default "The vote 
will be open for at least 72 hours". The release manager can change this on an 
ad hoc basis if we wish to keep doing faster votes while things are moving 
quickly


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] Print release vote e-mail template when making source release
> --
>
> Key: ARROW-2293
> URL: https://issues.apache.org/jira/browse/ARROW-2293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> This would help with streamlining the source release process. See 
> https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate
>  for an example



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2297) [JS] babel-jest is not listed as a dev dependency

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2297.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1737
[https://github.com/apache/arrow/pull/1737]

> [JS] babel-jest is not listed as a dev dependency
> -
>
> Key: ARROW-2297
> URL: https://issues.apache.org/jira/browse/ARROW-2297
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> babel-jest is not listed as a dev dependency, leading to the following error 
> on new clones of arrow js:
> {noformat}
> [10:21:08] Starting 'test:ts'...
> ● Validation Error:
>   Module ./node_modules/babel-jest/build/index.js in the transform option was 
> not found.
>   Configuration Documentation:
>   https://facebook.github.io/jest/docs/configuration.html
> [10:21:09] 'test:ts' errored after 306 ms
> [10:21:09] Error: exited with error code: 1
> at ChildProcess.onexit 
> (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
> at emitTwo (events.js:126:13)
> at ChildProcess.emit (events.js:214:7)
> at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
> [10:21:09] 'test' errored after 311 ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2297) [JS] babel-jest is not listed as a dev dependency

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395542#comment-16395542
 ] 

ASF GitHub Bot commented on ARROW-2297:
---

wesm closed pull request #1737: ARROW-2297: [JS] babel-jest is not listed as a 
dev dependency
URL: https://github.com/apache/arrow/pull/1737
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/js/package.json b/js/package.json
index af3c97f5e3..9bd9d13994 100644
--- a/js/package.json
+++ b/js/package.json
@@ -67,6 +67,7 @@
 "@types/glob": "5.0.35",
 "@types/jest": "22.1.0",
 "ast-types": "0.10.1",
+"babel-jest": "22.4.1",
 "benchmark": "2.1.4",
 "coveralls": "3.0.0",
 "del": "3.0.0",


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] babel-jest is not listed as a dev dependency
> -
>
> Key: ARROW-2297
> URL: https://issues.apache.org/jira/browse/ARROW-2297
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> babel-jest is not listed as a dev dependency, leading to the following error 
> on new clones of arrow js:
> {noformat}
> [10:21:08] Starting 'test:ts'...
> ● Validation Error:
>   Module ./node_modules/babel-jest/build/index.js in the transform option was 
> not found.
>   Configuration Documentation:
>   https://facebook.github.io/jest/docs/configuration.html
> [10:21:09] 'test:ts' errored after 306 ms
> [10:21:09] Error: exited with error code: 1
> at ChildProcess.onexit 
> (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
> at emitTwo (events.js:126:13)
> at ChildProcess.emit (events.js:214:7)
> at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
> [10:21:09] 'test' errored after 311 ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395540#comment-16395540
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

wesm commented on a change in pull request #1735: ARROW-2118: [C++] Fix 
misleading error when memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735#discussion_r173877848
 
 

 ##
 File path: cpp/src/arrow/io/file.cc
 ##
 @@ -624,16 +624,22 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer 
{
   is_mutable_ = false;
 }
 
-void* result = mmap(nullptr, static_cast(file_->size()), 
prot_flags, map_mode,
-file_->fd(), 0);
-if (result == MAP_FAILED) {
-  std::stringstream ss;
-  ss << "Memory mapping file failed, errno: " << errno;
-  return Status::IOError(ss.str());
+size_ = file_->size();
+
+void* result = nullptr;
+
+// Memory mapping fails when file size is 0
 
 Review comment:
   In theory, if the length is zero, this `data_` member should never be used 
for anything, so I might argue that having an ASAN / valgrind failure for 
passing nullptr someplace it shouldn't go would be a good thing


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-987) [JS] Implement JSON writer for Integration tests

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-987:
---
Fix Version/s: (was: JS-0.3.1)
   JS-0.4.0

> [JS] Implement JSON writer for Integration tests
> 
>
> Key: ARROW-987
> URL: https://issues.apache.org/jira/browse/ARROW-987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: JS-0.4.0
>
>
> Rather than storing generated binary files in the repo, we could just run the 
> integration tests on the JS implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395528#comment-16395528
 ] 

ASF GitHub Bot commented on ARROW-2292:
---

wesm commented on issue #1736: ARROW-2292: [Python] Rename frombuffer() to 
py_buffer()
URL: https://github.com/apache/arrow/pull/1736#issuecomment-372388943
 
 
   I tweaked to use the deprecation wrapper-maker and use FutureWarning (so 
this will show up even in non-interactive code -- this is the convention we've 
been using in pandas for visible deprecations so people should be used to it by 
now)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2293) [JS] Print release vote e-mail template when making source release

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395484#comment-16395484
 ] 

ASF GitHub Bot commented on ARROW-2293:
---

TheNeuralBit commented on issue #1738: ARROW-2293: [JS] Print release vote 
e-mail template when making source release
URL: https://github.com/apache/arrow/pull/1738#issuecomment-372377306
 
 
   @wesm I welcome any suggestions to modify the template, might as well get it 
right now. The current template is essentially copy-pasted from your 0.3.1 rc0 
email


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] Print release vote e-mail template when making source release
> --
>
> Key: ARROW-2293
> URL: https://issues.apache.org/jira/browse/ARROW-2293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> This would help with streamlining the source release process. See 
> https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate
>  for an example



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2293) [JS] Print release vote e-mail template when making source release

2018-03-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2293:
--
Labels: pull-request-available  (was: )

> [JS] Print release vote e-mail template when making source release
> --
>
> Key: ARROW-2293
> URL: https://issues.apache.org/jira/browse/ARROW-2293
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> This would help with streamlining the source release process. See 
> https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate
>  for an example



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395459#comment-16395459
 ] 

ASF GitHub Bot commented on ARROW-2282:
---

xhochy commented on a change in pull request #1720: ARROW-2282: [Python] Create 
StringArray from buffers
URL: https://github.com/apache/arrow/pull/1720#discussion_r173858008
 
 

 ##
 File path: python/pyarrow/tests/test_array.py
 ##
 @@ -258,6 +258,26 @@ def test_union_from_sparse():
 assert result.to_pylist() == [b'a', 1, b'b', b'c', 2, 3, b'd']
 
 
+def test_string_from_buffers():
+array = pa.array(["a", None, "b", "c"])
+
+buffers = array.buffers()
+copied = pa.StringArray.from_buffers(
+len(array), buffers[1], buffers[2], buffers[0], array.null_count,
+array.offset)
+assert copied.to_pylist() == ["a", None, "b", "c"]
+
+copied = pa.StringArray.from_buffers(
+len(array), buffers[1], buffers[2], buffers[0])
+assert copied.to_pylist() == ["a", None, "b", "c"]
+
+sliced = array[1:]
+copied = pa.StringArray.from_buffers(
+len(sliced), buffers[1], buffers[2], buffers[0], -1, sliced.offset)
+buffers = array.buffers()
+assert copied.to_pylist() == [None, "b", "c"]
 
 Review comment:
   Done and worked out of the box :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Create StringArray from buffers
> 
>
> Key: ARROW-2282
> URL: https://issues.apache.org/jira/browse/ARROW-2282
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> While we will add a more general-purpose functionality in 
> https://issues.apache.org/jira/browse/ARROW-2281, the interface is more 
> complicate then the constructor that explicitly states all arguments:  
> {{StringArray(int64_t length, const std::shared_ptr& value_offsets, 
> …}}
> Thus I will also expose this explicit constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395445#comment-16395445
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

cpcloud commented on a change in pull request #1735: ARROW-2118: [C++] Fix 
misleading error when memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735#discussion_r173854248
 
 

 ##
 File path: cpp/src/arrow/io/file.cc
 ##
 @@ -624,16 +624,22 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer 
{
   is_mutable_ = false;
 }
 
-void* result = mmap(nullptr, static_cast(file_->size()), 
prot_flags, map_mode,
-file_->fd(), 0);
-if (result == MAP_FAILED) {
-  std::stringstream ss;
-  ss << "Memory mapping file failed, errno: " << errno;
-  return Status::IOError(ss.str());
+size_ = file_->size();
+
+void* result = nullptr;
+
+// Memory mapping fails when file size is 0
+if (size_ > 0) {
+  result =
+  mmap(nullptr, static_cast(size_), prot_flags, map_mode, 
file_->fd(), 0);
+  if (result == MAP_FAILED) {
+std::stringstream ss;
+ss << "Memory mapping file failed, errno: " << errno;
 
 Review comment:
   It might be more useful to end users to use `strerror(errno)` here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395447#comment-16395447
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

cpcloud commented on a change in pull request #1735: ARROW-2118: [C++] Fix 
misleading error when memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735#discussion_r173854438
 
 

 ##
 File path: cpp/src/arrow/io/file.cc
 ##
 @@ -624,16 +624,22 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer 
{
   is_mutable_ = false;
 }
 
-void* result = mmap(nullptr, static_cast(file_->size()), 
prot_flags, map_mode,
-file_->fd(), 0);
-if (result == MAP_FAILED) {
-  std::stringstream ss;
-  ss << "Memory mapping file failed, errno: " << errno;
-  return Status::IOError(ss.str());
+size_ = file_->size();
+
+void* result = nullptr;
+
+// Memory mapping fails when file size is 0
+if (size_ > 0) {
+  result =
+  mmap(nullptr, static_cast(size_), prot_flags, map_mode, 
file_->fd(), 0);
+  if (result == MAP_FAILED) {
+std::stringstream ss;
+ss << "Memory mapping file failed, errno: " << errno;
 
 Review comment:
   That's in the `` header.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395442#comment-16395442
 ] 

ASF GitHub Bot commented on ARROW-2292:
---

pitrou commented on a change in pull request #1736: ARROW-2292: [Python] Rename 
frombuffer() to py_buffer()
URL: https://github.com/apache/arrow/pull/1736#discussion_r173853843
 
 

 ##
 File path: python/pyarrow/io.pxi
 ##
 @@ -849,6 +850,15 @@ def frombuffer(object obj):
 return pyarrow_wrap_buffer(buf)
 
 
+def frombuffer(object obj):
+"""
+Deprecated alias for `py_buffer`.
+"""
+warnings.warn("pa.frombuffer() is deprecated, use pa.py_buffer() instead",
+  DeprecationWarning)
 
 Review comment:
   Not by default, though it will show up in interactive settings such as 
IPython. I can switch to FutureWarning if you prefer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395440#comment-16395440
 ] 

ASF GitHub Bot commented on ARROW-2292:
---

wesm commented on a change in pull request #1736: ARROW-2292: [Python] Rename 
frombuffer() to py_buffer()
URL: https://github.com/apache/arrow/pull/1736#discussion_r173853157
 
 

 ##
 File path: python/pyarrow/io.pxi
 ##
 @@ -849,6 +850,15 @@ def frombuffer(object obj):
 return pyarrow_wrap_buffer(buf)
 
 
+def frombuffer(object obj):
+"""
+Deprecated alias for `py_buffer`.
+"""
+warnings.warn("pa.frombuffer() is deprecated, use pa.py_buffer() instead",
+  DeprecationWarning)
 
 Review comment:
   I thought DeprecationWarning does not show up by default. FutureWarning 
instead?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2297) [JS] babel-jest is not listed as a dev dependency

2018-03-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2297:
--
Labels: pull-request-available  (was: )

> [JS] babel-jest is not listed as a dev dependency
> -
>
> Key: ARROW-2297
> URL: https://issues.apache.org/jira/browse/ARROW-2297
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
>
> babel-jest is not listed as a dev dependency, leading to the following error 
> on new clones of arrow js:
> {noformat}
> [10:21:08] Starting 'test:ts'...
> ● Validation Error:
>   Module ./node_modules/babel-jest/build/index.js in the transform option was 
> not found.
>   Configuration Documentation:
>   https://facebook.github.io/jest/docs/configuration.html
> [10:21:09] 'test:ts' errored after 306 ms
> [10:21:09] Error: exited with error code: 1
> at ChildProcess.onexit 
> (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
> at emitTwo (events.js:126:13)
> at ChildProcess.emit (events.js:214:7)
> at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
> [10:21:09] 'test' errored after 311 ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2297) [JS] babel-jest is not listed as a dev dependency

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395369#comment-16395369
 ] 

ASF GitHub Bot commented on ARROW-2297:
---

TheNeuralBit opened a new pull request #1737: ARROW-2297: [JS] babel-jest is 
not listed as a dev dependency
URL: https://github.com/apache/arrow/pull/1737
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] babel-jest is not listed as a dev dependency
> -
>
> Key: ARROW-2297
> URL: https://issues.apache.org/jira/browse/ARROW-2297
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
>
> babel-jest is not listed as a dev dependency, leading to the following error 
> on new clones of arrow js:
> {noformat}
> [10:21:08] Starting 'test:ts'...
> ● Validation Error:
>   Module ./node_modules/babel-jest/build/index.js in the transform option was 
> not found.
>   Configuration Documentation:
>   https://facebook.github.io/jest/docs/configuration.html
> [10:21:09] 'test:ts' errored after 306 ms
> [10:21:09] Error: exited with error code: 1
> at ChildProcess.onexit 
> (/tmp/arrow/js/node_modules/end-of-stream/index.js:39:36)
> at emitTwo (events.js:126:13)
> at ChildProcess.emit (events.js:214:7)
> at Process.ChildProcess._handle.onexit (internal/child_process.js:198:12)
> [10:21:09] 'test' errored after 311 ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395351#comment-16395351
 ] 

ASF GitHub Bot commented on ARROW-2118:
---

pitrou commented on a change in pull request #1735: ARROW-2118: [C++] Fix 
misleading error when memory mapping a zero-length file
URL: https://github.com/apache/arrow/pull/1735#discussion_r173823696
 
 

 ##
 File path: cpp/src/arrow/io/file.cc
 ##
 @@ -624,16 +624,22 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer 
{
   is_mutable_ = false;
 }
 
-void* result = mmap(nullptr, static_cast(file_->size()), 
prot_flags, map_mode,
-file_->fd(), 0);
-if (result == MAP_FAILED) {
-  std::stringstream ss;
-  ss << "Memory mapping file failed, errno: " << errno;
-  return Status::IOError(ss.str());
+size_ = file_->size();
+
+void* result = nullptr;
+
+// Memory mapping fails when file size is 0
 
 Review comment:
   Is it desirable to set `data_` to a dummy non-null result when size is 0? 
For example a private static 0-length array. Some code may trip on a null 
pointer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve error message when calling parquet.read_table on an empty 
> file
> ---
>
> Key: ARROW-2118
> URL: https://issues.apache.org/jira/browse/ARROW-2118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Chris Ellison (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395348#comment-16395348
 ] 

Chris Ellison commented on ARROW-2227:
--

Yeah it's a contrived example, but think of a large data frame storing street 
addresses, or usernames, etc.

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395334#comment-16395334
 ] 

Antoine Pitrou edited comment on ARROW-2227 at 3/12/18 2:47 PM:


{quote}Just wanted to mention, in case it was missed, but this example isn't a 
single large 2 GiB string. Each row in the data frame is a single byte. So it 
is a large array of small bytes. {quote}

Oh, I see. I had misread the example (and my crash is on a different use case 
then). It's quite a weird way of storing binary strings, though? Your column is 
a column of Python objects, which under the hood appear to be numpy.int64 
objects... So you're paying a huge overhead because of all those objects.

(to put in perspective, I have 16 GB RAM, but creating your dataframe swaps 
out...)


was (Author: pitrou):
{quote}Just wanted to mention, in case it was missed, but this example isn't a 
single large 2 GiB string. Each row in the data frame is a single byte. So it 
is a large array of small bytes. {quote}

Oh, I see. I had misread the example (and my crash is on a different use case 
then). It's quite a weird way of storing binary strings, though? Your column is 
a column of Python objects, which under the hood appear to be numpy.int64 
objects... So you're paying a huge overhead because of all those objects.

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395334#comment-16395334
 ] 

Antoine Pitrou commented on ARROW-2227:
---

{quote}Just wanted to mention, in case it was missed, but this example isn't a 
single large 2 GiB string. Each row in the data frame is a single byte. So it 
is a large array of small bytes. {quote}

Oh, I see. I had misread the example (and my crash is on a different use case 
then). It's quite a weird way of storing binary strings, though? Your column is 
a column of Python objects, which under the hood appear to be numpy.int64 
objects... So you're paying a huge overhead because of all those objects.

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Chris Ellison (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395321#comment-16395321
 ] 

Chris Ellison commented on ARROW-2227:
--

Just wanted to mention, in case it was missed, but this example isn't a single 
large 2 GiB string. Each row in the data frame is a single byte. So it is a 
large array of small bytes. 

Is 0.10.0 far away? Out of the box, using pyarrow for "big data" isn't really 
possible (assuming you have string data) until this is fixed. My hack fix was 
to make changes to pandas_compat.py:dataframe_to_arrays() so that it accepted a 
dictionary mapping column names to chunk sizes, and then I manually (and 
somewhat crudely) create a chunked_array. This is then passed to 
Table.from_arrays (mimicking what appears in table.pxi).

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2292:
--
Labels: pull-request-available  (was: )

> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395313#comment-16395313
 ] 

ASF GitHub Bot commented on ARROW-2292:
---

pitrou opened a new pull request #1736: ARROW-2292: [Python] Rename 
frombuffer() to py_buffer()
URL: https://github.com/apache/arrow/pull/1736
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer

2018-03-12 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2292:
-

Assignee: Antoine Pitrou

> [Python] More consistent / intuitive name for pyarrow.frombuffer
> 
>
> Key: ARROW-2292
> URL: https://issues.apache.org/jira/browse/ARROW-2292
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could 
> call {{frombuffer}} something like {{py_buffer}} instead?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395292#comment-16395292
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

wesm commented on issue #1689: ARROW-2141: [Python] Support variable length 
binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#issuecomment-372324447
 
 
   I moved this JIRA to 0.10.0 so we can give this situation a working over


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2141:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395288#comment-16395288
 ] 

Wes McKinney commented on ARROW-2227:
-

OK, moving to 0.10.0.

We should raise an exception for a single string exceeding 2GB. We'll need to 
add a type for large binary / large strings to support this, see ARROW-750

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2227:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2141:

Fix Version/s: 0.9.0

> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395246#comment-16395246
 ] 

Antoine Pitrou commented on ARROW-2227:
---

By the way, is the enhancement requested in this ticket even doable with the 
current memory layout? If we create a chunked array and split the binary string 
in chunks, each string chunk will be visible as a separate logical array 
element, so the user won't see a 2GB string anymore, but for example two 1GB 
strings...

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395176#comment-16395176
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

pitrou commented on a change in pull request #1689: ARROW-2141: [Python] 
Support variable length binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#discussion_r173776925
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -164,18 +163,26 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
   RETURN_NOT_OK(builder->AppendNull());
   continue;
-} else if (!PyBytes_Check(obj)) {
+} else if (PyBytes_Check(obj)) {
 
 Review comment:
   We're probably doing this kind of dance (taking a bytes object and 
extracting a pointer and size) in other places already. I think it would be 
nice to factor that out somewhere (see e.g. `src/arrow/python/helpers.h`). That 
would also allow supporting bytearray in other places.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-12 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395173#comment-16395173
 ] 

Antoine Pitrou commented on ARROW-2227:
---

The snippet produces a core dump here (**). I think it is related to code 
touched by ARROW-2141 and especially the comment I posted here about an 
incorrect cast to 32-bit: 
https://github.com/apache/arrow/pull/1689/files#r173777819 .

(as a sidenote, I don't know why ARROW-2141 is necessary to allow conversion 
from Numpy but conversion from Pandas is already implemented. I suspect 
different paths are taken?)

(re-sidenote, where does the 2GB limit stem from? the desire to have shorter 
offset arrays?)

I would recommend deferring this to 0.10.0 so that we can sanitize the whole 
situation. There seem to be separate code paths converting Python bytes objects 
to Arrow data, with slightly different strategies...

(**) gdb backtrace:
{code}
#0  __memcpy_avx_unaligned () at 
../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:245
#1  0x7fffc23dc5e8 in arrow::BufferBuilder::UnsafeAppend 
(this=0x7fffb538, data=0x7fff3f01d030, length=-2147483648)
at /home/antoine/arrow/cpp/src/arrow/buffer.h:285
#2  0x7fffc23dc236 in arrow::BufferBuilder::Append (this=0x7fffb538, 
data=0x7fff3f01d030, length=-2147483648)
at /home/antoine/arrow/cpp/src/arrow/buffer.h:255
#3  0x7fffc242b30f in arrow::TypedBufferBuilder::Append 
(this=0x7fffb538, 
arithmetic_values=0x7fff3f01d030 'x' ..., 
num_elements=-2147483648) at /home/antoine/arrow/cpp/src/arrow/buffer.h:332
#4  0x7fffc23d7f3e in arrow::BinaryBuilder::Append (this=0x7fffb4a0, 
value=0x7fff3f01d030 'x' ..., length=-2147483648)
at /home/antoine/arrow/cpp/src/arrow/builder.cc:1343
#5  0x7fffc258d4b8 in arrow::BinaryBuilder::Append (this=0x7fffb4a0, 
value=0x7fff3f01d030 'x' ..., length=-2147483648)
at /home/antoine/arrow/cpp/src/arrow/builder.h:675
#6  0x7fffc1f84923 in arrow::py::AppendObjectStrings (arr=0x77ecbee0, 
mask=0x0, offset=0, builder=0x7fffb4a0, 
end_offset=0x7fffb480, have_bytes=0x7fffb430) at 
/home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:233
#7  0x7fffc1f88d62 in arrow::py::NumPyConverter::ConvertObjectStrings 
(this=0x7fffbdd0)
at /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:859
#8  0x7fffc1f8c22b in arrow::py::NumPyConverter::ConvertObjectsInfer 
(this=0x7fffbdd0)
at /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1034
#9  0x7fffc1f8d99c in arrow::py::NumPyConverter::ConvertObjects 
(this=0x7fffbdd0)
at /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1135
#10 0x7fffc1f85a2a in arrow::py::NumPyConverter::Convert 
(this=0x7fffbdd0)
at /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:504
#11 0x7fffc1f9038d in arrow::py::NdarrayToArrow (pool=0x7fffc2a2d680 
, 
ao=0x77ecbee0, mo=0x77d5ce90 <_Py_NoneStruct>, 
use_pandas_null_sentinels=true, type=..., out=0x7fffc0c0)
at /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1577
{code}

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395150#comment-16395150
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

pitrou commented on a change in pull request #1689: ARROW-2141: [Python] 
Support variable length binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#discussion_r17349
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -164,18 +163,26 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
   RETURN_NOT_OK(builder->AppendNull());
   continue;
-} else if (!PyBytes_Check(obj)) {
+} else if (PyBytes_Check(obj)) {
+  const int32_t length = static_cast(PyBytes_GET_SIZE(obj));
+  if (ARROW_PREDICT_FALSE(builder->value_data_length() + length >
+  kBinaryMemoryLimit)) {
+break;
 
 Review comment:
   After reading the code a bit more carefully, I understand... though there is 
still a problem: what if `length` is larger than `kBinaryMemoryLimit`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395151#comment-16395151
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

pitrou commented on a change in pull request #1689: ARROW-2141: [Python] 
Support variable length binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#discussion_r173777819
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -164,18 +163,26 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
   RETURN_NOT_OK(builder->AppendNull());
   continue;
-} else if (!PyBytes_Check(obj)) {
+} else if (PyBytes_Check(obj)) {
+  const int32_t length = static_cast(PyBytes_GET_SIZE(obj));
 
 Review comment:
   The cast is entirely wrong :-(


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395149#comment-16395149
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

pitrou commented on a change in pull request #1689: ARROW-2141: [Python] 
Support variable length binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#discussion_r173777205
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -164,18 +163,26 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
   RETURN_NOT_OK(builder->AppendNull());
   continue;
-} else if (!PyBytes_Check(obj)) {
+} else if (PyBytes_Check(obj)) {
+  const int32_t length = static_cast(PyBytes_GET_SIZE(obj));
+  if (ARROW_PREDICT_FALSE(builder->value_data_length() + length >
+  kBinaryMemoryLimit)) {
+break;
 
 Review comment:
   I have trouble parsing this... If we don't know how to convert this, we 
should fail, not simply break from the loop, no?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2141) [Python] Conversion from Numpy object array to varsize binary unimplemented

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395144#comment-16395144
 ] 

ASF GitHub Bot commented on ARROW-2141:
---

pitrou commented on a change in pull request #1689: ARROW-2141: [Python] 
Support variable length binary conversion from Pandas
URL: https://github.com/apache/arrow/pull/1689#discussion_r173776925
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -164,18 +163,26 @@ static Status AppendObjectBinaries(PyArrayObject* arr, 
PyArrayObject* mask,
 if ((have_mask && mask_values[offset]) || PandasObjectIsNull(obj)) {
   RETURN_NOT_OK(builder->AppendNull());
   continue;
-} else if (!PyBytes_Check(obj)) {
+} else if (PyBytes_Check(obj)) {
 
 Review comment:
   We're probably doing this kind of dance (taking a bytes object and 
extracting a pointer and size) in other places already. I think it would be 
nice to factor that that somewhere (see e.g. `src/arrow/python/helpers.h`). 
That would also allow supporting bytearray in other places.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy object array to varsize binary unimplemented
> ---
>
> Key: ARROW-2141
> URL: https://issues.apache.org/jira/browse/ARROW-2141
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([b'xx'], dtype=np.object)
> >>> pa.array(arr, type=pa.binary(2))
> 
> [
>   b'xx'
> ]
> >>> pa.array(arr, type=pa.binary())
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.binary())
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1098 code: 
> compute::Cast(, *arr, type_, options, )
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1022 code: 
> Cast(ctx, Datum(array.data()), out_type, options, _out)
> /home/antoine/arrow/cpp/src/arrow/compute/kernels/cast.cc:1009 code: 
> GetCastFunction(*value.type(), out_type, options, )
> No cast implemented from binary to binary
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)