[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

Vibhatha Lakmal Abeykoon (Jira) Thu, 10 Nov 2022 08:51:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631783#comment-17631783
 ]


Vibhatha Lakmal Abeykoon commented on ARROW-15716:
--------------------------------------------------

[~ldacey] Can we always guarantee that the operator is always going to be a 
`or` or `and`? Can it be a mix of those operators, when you want to filter out 
like a band-pass filter. 

I could be misunderstanding the objective here, but just curious. Or should we 
expose a UDF and let the user decide how it needs to be applied.

cc [~westonpace]

> [Dataset][Python] Parse a list of fragment paths to gather filters
> ------------------------------------------------------------------
>
>                 Key: ARROW-15716
>                 URL: https://issues.apache.org/jira/browse/ARROW-15716
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Lance Dacey
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [<pyarrow.compute.Expression (date_id == 20210813)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

Reply via email to