[ https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs updated ARROW-11400: ------------------------------------ Fix Version/s: (was: 3.0.0) 4.0.0 > [Python] Pickled ParquetFileFragment has invalid partition_expresion with > dictionary type in pyarrow 2.0 > -------------------------------------------------------------------------------------------------------- > > Key: ARROW-11400 > URL: https://issues.apache.org/jira/browse/ARROW-11400 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Joris Van den Bossche > Assignee: Joris Van den Bossche > Priority: Minor > Labels: dataset, pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > From https://github.com/dask/dask/pull/7066#issuecomment-767156623 > Simplified reproducer: > {code:python} > import pyarrow.parquet as pq > import pyarrow.dataset as ds > table = pa.table({'part': ['A', 'B']*5, 'col': range(10)}) > pq.write_to_dataset(table, "test_partitioned_parquet", > partition_cols=["part"]) > # with partitioning_kwargs = {} there is no error > partitioning_kwargs = {"max_partition_dictionary_size": -1} > dataset = ds.dataset( > "test_partitioned_parquet/", format="parquet", > partitioning=ds.HivePartitioning.discover( **partitioning_kwargs) > ) > frag = list(dataset.get_fragments())[0] > {code} > Querying this fragment works fine, but after serialization/deserialization > with pickle, it gives errors (and with the original data example I actually > got a segfault as well): > {code} > In [16]: import pickle > In [17]: frag2 = pickle.loads(pickle.dumps(frag)) > In [19]: frag2.partition_expression > ... > UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: > invalid continuation byte > In [20]: frag2.to_table(schema=schema, columns=columns) > Out[20]: > pyarrow.Table > col: int64 > part: dictionary<values=string, indices=int32, ordered=0> > In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas() > ... > ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in > pyarrow.lib.table_to_blocks() > ArrowException: Unknown error: Wrapping ɻ� failed > {code} > It seems the issue was specifically with a partition expression with > dictionary type. > Also when using an integer columns as the partition column, you get wrong > values (but silently in this case): > {code:python} > In [42]: frag.partition_expression > Out[42]: > <pyarrow.dataset.Expression (part == [ > 1, > 2 > ][0]:dictionary<values=int32, indices=int32, ordered=0>)> > In [43]: frag2.partition_expression > Out[43]: > <pyarrow.dataset.Expression (part == [ > 170145232, > 32754 > ][0]:dictionary<values=int32, indices=int32, ordered=0>)> > {code} > Now, it seems this is fixed in master. But since I don't remember it was > fixed intentionally ([~bkietz]?), it would be good to add some tests for it. -- This message was sent by Atlassian Jira (v8.3.4#803005)