[ https://issues.apache.org/jira/browse/ARROW-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-9476: ---------------------------------- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] HivePartitioning discovery with dictionary types fails for > multiple fields > ----------------------------------------------------------------------------------------- > > Key: ARROW-9476 > URL: https://issues.apache.org/jira/browse/ARROW-9476 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Labels: dataset, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Apparently, ARROW-9288 was not fully / correctly fixing the issue. With a > single string partition field, it now works fine. But once you have multiple > string fields, you get parsing errors. > A reproducible example: > {code} > import numpy as np > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > foo_keys = np.array(['a', 'b', 'c'], dtype=object) > bar_keys = np.array(['d', 'e', 'f'], dtype=object) > N = 30 > table = pa.table({ > 'foo': foo_keys.repeat(10), > 'bar': np.tile(np.tile(bar_keys, 5), 2), > 'values': np.random.randn(N) > }) > base_path = "test_partition_directories3" > pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"]) > # works > ds.dataset(base_path, partitioning="hive") > # fails > part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) > ds.dataset(base_path, partitioning=part) > {code} > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)