[ https://issues.apache.org/jira/browse/ARROW-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-8088: -------------------------------------------- Assignee: Ben Kietzman (was: Joris Van den Bossche) > [C++][Dataset] Partition columns with specified dictionary type result in all > nulls > ----------------------------------------------------------------------------------- > > Key: ARROW-8088 > URL: https://issues.apache.org/jira/browse/ARROW-8088 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Dataset > Reporter: Joris Van den Bossche > Assignee: Ben Kietzman > Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When specifying an explicit schema for the Partitioning, and when using a > dictionary type, the materialization of the partition keys goes wrong: you > don't get an error, but you get columns with all nulls. > Python example: > {code:python} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > When reading with discovery, all is fine: > {code:python} > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().schema > values: double > bar: string > foo: int32 > >>> ds.dataset("test_order", format="parquet", > >>> partitioning="hive").to_table().to_pandas().head(2) > values bar foo > 0 2.505903 a 0 > 1 -1.760135 a 0 > {code} > But when specifying the partition columns to be dictionary type with explicit > {{HivePartitioning}}, you get no error but all null values: > {code:python} > >>> partitioning = ds.HivePartitioning(pa.schema([ > ... ("foo", pa.dictionary(pa.int32(), pa.int64())), > ... ("bar", pa.dictionary(pa.int32(), pa.string())) > ... ])) > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().schema > values: double > foo: dictionary<values=int64, indices=int32, ordered=0> > bar: dictionary<values=string, indices=int32, ordered=0> > >>> ds.dataset("test_order", format="parquet", > >>> partitioning=partitioning).to_table().to_pandas().head(2) > values foo bar > 0 2.505903 NaN NaN > 1 -1.760135 NaN NaN > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)