[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631792#comment-17631792 ]
Weston Pace commented on ARROW-15716: ------------------------------------- I am pretty sure the operator is always OR based on: {quote} ultimate goal is to create a single expression which would filter all unique partitions that had data written into them. {quote} An OR expression would give you all the partitions that had data written. In fact, partition expressions are always disjoint (e.g. x == 7 vs x == 8) so ANDing any of the returned expressions will always give you an empty set. > [Dataset][Python] Parse a list of fragment paths to gather filters > ------------------------------------------------------------------ > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Affects Versions: 7.0.0 > Reporter: Lance Dacey > Assignee: Vibhatha Lakmal Abeykoon > Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [<pyarrow.compute.Expression (date_id == 20210813)>, > <pyarrow.compute.Expression (date_id == 20210114)>, > <pyarrow.compute.Expression (date_id == 20210114)>, > <pyarrow.compute.Expression (date_id == 20210114)>] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)