Joris Van den Bossche created ARROW-8088: --------------------------------------------
Summary: [C++][Dataset] Partition columns with specified dictionary type result in all nulls Key: ARROW-8088 URL: https://issues.apache.org/jira/browse/ARROW-8088 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche When specifying an explicit schema for the Partitioning, and when using a dictionary type, the materialization of the partition keys goes wrong: you don't get an error, but you get columns with all nulls. Python example: {code} foo_keys = [0, 1] bar_keys = ['a', 'b', 'c'] N = 30 df = pd.DataFrame({ 'foo': np.array(foo_keys, dtype='i4').repeat(15), 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), 'values': np.random.randn(N) }) pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) {code} When reading with discovery, all is fine: {code} >>> ds.dataset("test_order", format="parquet", >>> partitioning="hive").to_table().schema values: double bar: string foo: int32 >>> ds.dataset("test_order", format="parquet", >>> partitioning="hive").to_table().to_pandas().head(2) values bar foo 0 2.505903 a 0 1 -1.760135 a 0 {code} But when specifying the partition columns to be dictionary type with explicit {{HivePartitioning}}, you get no error but all null values: {code} >>> partitioning = ds.HivePartitioning(pa.schema([ ... ("foo", pa.dictionary(pa.int32(), pa.int64())), ... ("bar", pa.dictionary(pa.int32(), pa.string())) ... ])) >>> ds.dataset("test_order", format="parquet", >>> partitioning=partitioning).to_table().schema values: double foo: dictionary<values=int64, indices=int32, ordered=0> bar: dictionary<values=string, indices=int32, ordered=0> >>> ds.dataset("test_order", format="parquet", >>> partitioning=partitioning).to_table().to_pandas().head(2) values foo bar 0 2.505903 NaN NaN 1 -1.760135 NaN NaN {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)