[ 
https://issues.apache.org/jira/browse/ARROW-8655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375905#comment-17375905
 ] 

Joris Van den Bossche commented on ARROW-8655:
----------------------------------------------

I should maybe have been more explicit, but I think this is fine if the above 
(eg the {{FileSystemDataset.partitioning}} attribute I proposed in the PR) only 
works for datasets that were created through the factory function, since that 
covers many typical use cases (and specifically the use case of dask) and is 
indeed also the only use case where this information can reliably be known. I 
think it is fine that this will not work (i.e. return None) for eg union 
datasets. 

Similarly for the "partition_expression": if it's created through discovery 
with a Directory/HivePartitioning, we know that the partition expression will 
always only include equalities. 
Indeed in general this will not be true, but again I think that is fine 
(although that's maybe a reason to not make this an attribute on the 
FileFragment, but keep it as a function extracting the information from the 
partition expression).

> Maybe there is something we can add to the dataset factory so that calling 
> Finish (or perhaps adding a new property that can be accessed after calling 
> Finish) could return dictionaries of everything it discovered.

Currently those dictionaries are accessible from the {{Partitioning}} object 
inside the {{Finish()}} call, but indeed after calling {{Finish()}} you can't 
access this because the {{Partitioning}} object is not stored in either the 
returned dataset or on the original factory object. 
Making it available on the FileSystemDatasetFactory instead of attaching it to 
the returned FileSystemDataset (as I am doing in the PR -> 
https://github.com/apache/arrow/pull/10661) is an option as well, and then the 
Python layer could handle it (and attaching it to the cython Dataset class). 
[~westonpace] that's maybe something to comment on the PR if you prefer that 
way.

> [C++][Dataset][Python][R] Preserve partitioning information for a discovered 
> Dataset
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-8655
>                 URL: https://issues.apache.org/jira/browse/ARROW-8655
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration, pull-request-available
>             Fix For: 5.0.0
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
> classes that describe a partitioning used in the discovery phase. But once a 
> dataset object is created, it doesn't know any more about this, it just has 
> partition expressions for the fragments. And the partition keys are added to 
> the schema, but you can't directly know which columns of the schema 
> originated from the partitions.
> However, there can be use cases where it would be useful that a dataset still 
> "knows" from what kind of partitioning it was created:
> - The "read CSV write back Parquet" use case, where the CSV was already 
> partitioned and you want to automatically preserve that partitioning for 
> parquet (kind of roundtripping the partitioning on read/write)
> - To convert the dataset to other representation, eg conversion to pandas, it 
> can be useful to know what columns were partition columns (eg for pandas, 
> those columns might be good candidates to be set as the index of the 
> pandas/dask DataFrame). I can imagine conversions to other systems can use 
> similar information.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to