Joris Van den Bossche created ARROW-8087:
--------------------------------------------

             Summary: [C++][Dataset] Order of keys with HivePartitioning is 
lost in resulting schema
                 Key: ARROW-8087
                 URL: https://issues.apache.org/jira/browse/ARROW-8087
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++ - Dataset
            Reporter: Joris Van den Bossche


Currently, when reading a partitioned dataset with hive partitioning, it seems 
that the partition columns get sorted alphabetically when appending them to the 
schema (while the old ParquetDataset implementation keeps the order as it is 
present in the paths).  
For a regular partitioning this order is consistent for all fragments.

So for example for the typical NYC Taxi data example, with datasets, the schema 
ends with columns "month, year", while the ParquetDataset appends them as 
"year, month".

Python example:

{code}
foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30

df = pd.DataFrame({
    'foo': np.array(foo_keys, dtype='i4').repeat(15),
    'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
    'values': np.random.randn(N)
})

pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
{code}

{code}
>>> pq.read_table("test_order").schema
values: double
foo: dictionary<values=int64, indices=int32, ordered=0>
bar: dictionary<values=string, indices=int32, ordered=0>

>>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
values: double
bar: string
foo: int32
{code}

so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something 
else)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to