[jira] [Comment Edited] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

JIRA Fri, 31 May 2019 04:49:10 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852939#comment-16852939
 ]


Robin Kåveland edited comment on ARROW-5430 at 5/31/19 11:48 AM:
-----------------------------------------------------------------

Okay, I must admit to being a bit stumped here. I followed the trail from 
{{_sequence_to_array}} to find out where the `ArrowUnknown` is coming from. And 
I'm quite sure it must be 
[CIntFromPythonImpl|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.cc#L179]].
 Here, we call some CPython APIs, namely 
[PyLong_AsLong|[https://docs.python.org/3/c-api/long.html#c.PyLong_AsLong]] and 
PyLong_AsLongLong, both of which return {{-1}} for overflows. Then, we call 
{{RETURN_IF_PYERROR}} in the case where we get {{-1}}. 
[This|https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L36-L51]
 block of code looks like it could be the right place to make the change. But 
now I'm very much on thin ice as I don't know much C++ at all and I'm also not 
very familiar with the CPython C API.

I'm guessing the right "fix" would be something like adding a branch to 
{{ConvertPyError}} that checks 
{{PyErr_ExceptionMatches(}}{{PyExc_OverflowError}}{{)}}{{?}}


was (Author: kaaveland):
Okay, I must admit to being a bit stumped here. I followed the trail from 
{{_sequence_to_array}} to find out where the `ArrowUnknown` is coming from. And 
I'm quite sure it must be 
[CIntFromPythonImpl|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.cc#L179].]
 Here, we call some CPython APIs, namely 
[PyLong_AsLong|[https://docs.python.org/3/c-api/long.html#c.PyLong_AsLong]] and 
PyLong_AsLongLong, both of which return {{-1}} for overflows. Then, we call 
{{RETURN_IF_PYERROR}} in the case where we get {{-1}}. 
[This|https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L36-L51]
 block of code looks like it could be the right place to make the change. But 
now I'm very much on thin ice as I don't know much C++ at all and I'm also not 
very familiar with the CPython C API.

I'm guessing the right "fix" would be something like adding a branch to 
{{ConvertPyError}} that checks 
{{PyErr_ExceptionMatches(}}{{PyExc_OverflowError}}{{)}}{{}}{{?}}

> [Python] Can read but not write parquet partitioned on large ints
> -----------------------------------------------------------------
>
>                 Key: ARROW-5430
>                 URL: https://issues.apache.org/jira/browse/ARROW-5430
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>         Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64.
>            Reporter: Robin Kåveland
>            Priority: Minor
>              Labels: parquet
>
> Here's a contrived example that reproduces this issue using pandas:
> {code:java}
> import numpy as np
> import pandas as pd
> real_usernames = np.array(['anonymize', 'me'])
> usernames = pd.util.hash_array(real_usernames)
> login_count = [13, 9]
> df = pd.DataFrame({'user': usernames, 'logins': login_count})
> df.to_parquet('can_write.parq', partition_cols=['user'])
> # But not read
> pd.read_parquet('can_write.parq'){code}
> Expected behaviour:
>  * Either the write fails
>  * Or the read succeeds
> Actual behaviour: The read fails with the following error:
> {code:java}
> Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 282, in read_parquet
>     return impl.read(path, columns=columns, **kwargs)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 129, in read
>     **kwargs).to_pandas()
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1152, in read_table
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py",
>  line 181, in read_parquet
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 1014, in read
>     use_pandas_metadata=use_pandas_metadata)
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 587, in read
>     dictionary = partitions.levels[i].dictionary
>   File 
> "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py",
>  line 642, in dictionary
>     dictionary = lib.array(integer_keys)
>   File "pyarrow/array.pxi", line 173, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array
>   File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status
> pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to 
> C long{code}
> I set the priority to minor here because it's easy enough to work around this 
> in user code unless you really need the 64 bit hash (and you probably 
> shouldn't be partitioning on that anyway).
> I could take a stab at writing a patch for this if there's interest?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-5430) [Python] Can read but not write parquet partitioned on large ints

Reply via email to