[ https://issues.apache.org/jira/browse/ARROW-5430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852939#comment-16852939 ]
Robin Kåveland edited comment on ARROW-5430 at 5/31/19 11:48 AM: ----------------------------------------------------------------- Okay, I must admit to being a bit stumped here. I followed the trail from {{_sequence_to_array}} to find out where the `ArrowUnknown` is coming from. And I'm quite sure it must be [CIntFromPythonImpl|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.cc#L179]]. Here, we call some CPython APIs, namely [PyLong_AsLong|[https://docs.python.org/3/c-api/long.html#c.PyLong_AsLong]] and PyLong_AsLongLong, both of which return {{-1}} for overflows. Then, we call {{RETURN_IF_PYERROR}} in the case where we get {{-1}}. [This|https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L36-L51] block of code looks like it could be the right place to make the change. But now I'm very much on thin ice as I don't know much C++ at all and I'm also not very familiar with the CPython C API. I'm guessing the right "fix" would be something like adding a branch to {{ConvertPyError}} that checks {{PyErr_ExceptionMatches(}}{{PyExc_OverflowError}}{{)}}{{?}} was (Author: kaaveland): Okay, I must admit to being a bit stumped here. I followed the trail from {{_sequence_to_array}} to find out where the `ArrowUnknown` is coming from. And I'm quite sure it must be [CIntFromPythonImpl|[https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/helpers.cc#L179].] Here, we call some CPython APIs, namely [PyLong_AsLong|[https://docs.python.org/3/c-api/long.html#c.PyLong_AsLong]] and PyLong_AsLongLong, both of which return {{-1}} for overflows. Then, we call {{RETURN_IF_PYERROR}} in the case where we get {{-1}}. [This|https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L36-L51] block of code looks like it could be the right place to make the change. But now I'm very much on thin ice as I don't know much C++ at all and I'm also not very familiar with the CPython C API. I'm guessing the right "fix" would be something like adding a branch to {{ConvertPyError}} that checks {{PyErr_ExceptionMatches(}}{{PyExc_OverflowError}}{{)}}{{}}{{?}} > [Python] Can read but not write parquet partitioned on large ints > ----------------------------------------------------------------- > > Key: ARROW-5430 > URL: https://issues.apache.org/jira/browse/ARROW-5430 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0 > Environment: Mac OSX 10.14.4, Python 3.7.1, x86_64. > Reporter: Robin Kåveland > Priority: Minor > Labels: parquet > > Here's a contrived example that reproduces this issue using pandas: > {code:java} > import numpy as np > import pandas as pd > real_usernames = np.array(['anonymize', 'me']) > usernames = pd.util.hash_array(real_usernames) > login_count = [13, 9] > df = pd.DataFrame({'user': usernames, 'logins': login_count}) > df.to_parquet('can_write.parq', partition_cols=['user']) > # But not read > pd.read_parquet('can_write.parq'){code} > Expected behaviour: > * Either the write fails > * Or the read succeeds > Actual behaviour: The read fails with the following error: > {code:java} > Traceback (most recent call last): > File "<stdin>", line 2, in <module> > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", > line 282, in read_parquet > return impl.read(path, columns=columns, **kwargs) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pandas/io/parquet.py", > line 129, in read > **kwargs).to_pandas() > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1152, in read_table > use_pandas_metadata=use_pandas_metadata) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/filesystem.py", > line 181, in read_parquet > use_pandas_metadata=use_pandas_metadata) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 1014, in read > use_pandas_metadata=use_pandas_metadata) > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 587, in read > dictionary = partitions.levels[i].dictionary > File > "/Users/robinkh/code/venvs/datamunge/lib/python3.7/site-packages/pyarrow/parquet.py", > line 642, in dictionary > dictionary = lib.array(integer_keys) > File "pyarrow/array.pxi", line 173, in pyarrow.lib.array > File "pyarrow/array.pxi", line 36, in pyarrow.lib._sequence_to_array > File "pyarrow/error.pxi", line 104, in pyarrow.lib.check_status > pyarrow.lib.ArrowException: Unknown error: Python int too large to convert to > C long{code} > I set the priority to minor here because it's easy enough to work around this > in user code unless you really need the 64 bit hash (and you probably > shouldn't be partitioning on that anyway). > I could take a stab at writing a patch for this if there's interest? -- This message was sent by Atlassian JIRA (v7.6.3#76005)