[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631783#comment-17631783 ]
Vibhatha Lakmal Abeykoon commented on ARROW-15716: -------------------------------------------------- [~ldacey] Can we always guarantee that the operator is always going to be a `or` or `and`? Can it be a mix of those operators, when you want to filter out like a band-pass filter. I could be misunderstanding the objective here, but just curious. Or should we expose a UDF and let the user decide how it needs to be applied. cc [~westonpace] > [Dataset][Python] Parse a list of fragment paths to gather filters > ------------------------------------------------------------------ > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python > Affects Versions: 7.0.0 > Reporter: Lance Dacey > Assignee: Vibhatha Lakmal Abeykoon > Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [<pyarrow.compute.Expression (date_id == 20210813)>, > <pyarrow.compute.Expression (date_id == 20210114)>, > <pyarrow.compute.Expression (date_id == 20210114)>, > <pyarrow.compute.Expression (date_id == 20210114)>] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)