[ https://issues.apache.org/jira/browse/ARROW-8087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-8087: ----------------------------------------- Fix Version/s: 0.17.0 > [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema > ------------------------------------------------------------------------------ > > Key: ARROW-8087 > URL: https://issues.apache.org/jira/browse/ARROW-8087 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset > Reporter: Joris Van den Bossche > Priority: Major > Fix For: 0.17.0 > > > Currently, when reading a partitioned dataset with hive partitioning, it > seems that the partition columns get sorted alphabetically when appending > them to the schema (while the old ParquetDataset implementation keeps the > order as it is present in the paths). > For a regular partitioning this order is consistent for all fragments. > So for example for the typical NYC Taxi data example, with datasets, the > schema ends with columns "month, year", while the ParquetDataset appends them > as "year, month". > Python example: > {code} > foo_keys = [0, 1] > bar_keys = ['a', 'b', 'c'] > N = 30 > df = pd.DataFrame({ > 'foo': np.array(foo_keys, dtype='i4').repeat(15), > 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), > 'values': np.random.randn(N) > }) > pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar']) > {code} > {code} > >>> pq.read_table("test_order").schema > values: double > foo: dictionary<values=int64, indices=int32, ordered=0> > bar: dictionary<values=string, indices=int32, ordered=0> > >>> ds.dataset("test_order", format="parquet", partitioning="hive").schema > values: double > bar: string > foo: int32 > {code} > so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something > else) -- This message was sent by Atlassian Jira (v8.3.4#803005)