davlee1972 commented on issue #38485:
URL: https://github.com/apache/arrow/issues/38485#issuecomment-2715377508

   Ok I found a workaround, but it would be better if this was automatically 
handled by pyarrow.dataset.dataset()..
   
   
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset
   dataset() has a partition_base_dir parameter now..
   ```
   partition_base_dir: str
   For the purposes of applying the partitioning, paths will be stripped of the 
partition_base_dir. Files not matching the partition_base_dir prefix will be 
skipped for partitioning discovery. The ignored files will still be part of the 
Dataset, but will not have partition information.
   ```
   Can we automatically determine partition_base_dir if dataset() is called 
with a list of files and partitioning?
   
   ```
   import pyarrow.dataset as ds
   import pyarrow as pa
   
   my_dataset = ds.dataset([
       "c:/temp/abc/xyz/usa/20230926/IndexConstituents20230926.csv",
       "c:/temp/abc/xyz/france/20230927/IndexConstituents20230927.csv",
       "c:/temp/abc/xyz/germany/20230928/IndexConstituents20230928.csv",
       ],
       partitioning= ds.partitioning(pa.schema([("country", 
pa.string()),("date_as_int", pa.int32())]))
       partition_base_dir ="c:/temp/abc/xyz"
   )
   ```
   A dataset() call like above should automatically figure out and pass 
partition_base_dir.
   
   You can count the number of partitioning columns and walk back file name 
directories.
   c:/temp/abc/xyz/usa/20230926 with 2 partitioning columns becomes 
c:/temp/abc/xyz


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to