[ https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375905#comment-17375905 ]
Joris Van den Bossche commented on ARROW-8655: ---------------------------------------------- I should maybe have been more explicit, but I think this is fine if the above (eg the {{FileSystemDataset.partitioning}} attribute I proposed in the PR) only works for datasets that were created through the factory function, since that covers many typical use cases (and specifically the use case of dask) and is indeed also the only use case where this information can reliably be known. I think it is fine that this will not work (i.e. return None) for eg union datasets. Similarly for the "partition_expression": if it's created through discovery with a Directory/HivePartitioning, we know that the partition expression will always only include equalities. Indeed in general this will not be true, but again I think that is fine (although that's maybe a reason to not make this an attribute on the FileFragment, but keep it as a function extracting the information from the partition expression). > Maybe there is something we can add to the dataset factory so that calling > Finish (or perhaps adding a new property that can be accessed after calling > Finish) could return dictionaries of everything it discovered. Currently those dictionaries are accessible from the {{Partitioning}} object inside the {{Finish()}} call, but indeed after calling {{Finish()}} you can't access this because the {{Partitioning}} object is not stored in either the returned dataset or on the original factory object. Making it available on the FileSystemDatasetFactory instead of attaching it to the returned FileSystemDataset (as I am doing in the PR -> https://github.com/apache/arrow/pull/10661) is an option as well, and then the Python layer could handle it (and attaching it to the cython Dataset class). [~westonpace] that's maybe something to comment on the PR if you prefer that way. > [C++][Dataset][Python][R] Preserve partitioning information for a discovered > Dataset > ------------------------------------------------------------------------------------ > > Key: ARROW-8655 > URL: https://issues.apache.org/jira/browse/ARROW-8655 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Joris Van den Bossche > Priority: Major > Labels: dataset, dataset-dask-integration, pull-request-available > Fix For: 5.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} > classes that describe a partitioning used in the discovery phase. But once a > dataset object is created, it doesn't know any more about this, it just has > partition expressions for the fragments. And the partition keys are added to > the schema, but you can't directly know which columns of the schema > originated from the partitions. > However, there can be use cases where it would be useful that a dataset still > "knows" from what kind of partitioning it was created: > - The "read CSV write back Parquet" use case, where the CSV was already > partitioned and you want to automatically preserve that partitioning for > parquet (kind of roundtripping the partitioning on read/write) > - To convert the dataset to other representation, eg conversion to pandas, it > can be useful to know what columns were partition columns (eg for pandas, > those columns might be good candidates to be set as the index of the > pandas/dask DataFrame). I can imagine conversions to other systems can use > similar information. -- This message was sent by Atlassian Jira (v8.3.4#803005)