davlee1972 commented on issue #38485: URL: https://github.com/apache/arrow/issues/38485#issuecomment-2715377508
Ok I found a workaround, but it would be better if this was automatically handled by pyarrow.dataset.dataset().. https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset dataset() has a partition_base_dir parameter now.. ``` partition_base_dir: str For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information. ``` Can we automatically determine partition_base_dir if dataset() is called with a list of files and partitioning? ``` import pyarrow.dataset as ds import pyarrow as pa my_dataset = ds.dataset([ "c:/temp/abc/xyz/usa/20230926/IndexConstituents20230926.csv", "c:/temp/abc/xyz/france/20230927/IndexConstituents20230927.csv", "c:/temp/abc/xyz/germany/20230928/IndexConstituents20230928.csv", ], partitioning= ds.partitioning(pa.schema([("country", pa.string()),("date_as_int", pa.int32())])) partition_base_dir ="c:/temp/abc/xyz" ) ``` A dataset() call like above should automatically figure out and pass partition_base_dir. You can count the number of partitioning columns and walk back file name directories. c:/temp/abc/xyz/usa/20230926 with 2 partitioning columns becomes c:/temp/abc/xyz -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org