[ 
https://issues.apache.org/jira/browse/ARROW-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9476:
----------------------------------
    Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] HivePartitioning discovery with dictionary types fails for 
> multiple fields
> -----------------------------------------------------------------------------------------
>
>                 Key: ARROW-9476
>                 URL: https://issues.apache.org/jira/browse/ARROW-9476
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Apparently, ARROW-9288 was not fully / correctly fixing the issue. With a 
> single string partition field, it now works fine. But once you have multiple 
> string fields, you get parsing errors.
> A reproducible example:
> {code}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds 
> foo_keys = np.array(['a', 'b', 'c'], dtype=object)
> bar_keys = np.array(['d', 'e', 'f'], dtype=object)
> N = 30
> table = pa.table({
>     'foo': foo_keys.repeat(10),
>     'bar': np.tile(np.tile(bar_keys, 5), 2),
>     'values': np.random.randn(N)
> })
> base_path = "test_partition_directories3"
> pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])
> # works
> ds.dataset(base_path, partitioning="hive")
> # fails
> part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
> ds.dataset(base_path, partitioning=part)
> {code}
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to