[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11400:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-04-18 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11400:
--
Fix Version/s: (was: 4.0.0)
   3.0.0

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-04-18 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-11400:

Fix Version/s: (was: 3.0.0)
   4.0.0

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11400) [Python] Pickled ParquetFileFragment has invalid partition_expresion with dictionary type in pyarrow 2.0

2021-01-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11400:
---
Labels: dataset pull-request-available  (was: dataset)

> [Python] Pickled ParquetFileFragment has invalid partition_expresion with 
> dictionary type in pyarrow 2.0
> 
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet", 
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet", 
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization 
> with pickle, it gives errors (and with the original data example I actually 
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16: 
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]: 
> pyarrow.Table
> col: int64
> part: dictionary
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with 
> dictionary type. 
> Also when using an integer columns as the partition column, you get wrong 
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]: 
>1,
>   2
> ][0]:dictionary)>
> In [43]: frag2.partition_expression
> Out[43]: 
>170145232,
>   32754
> ][0]:dictionary)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was 
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)