[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

Weston Pace (Jira) Tue, 08 Nov 2022 22:30:29 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630776#comment-17630776
 ]


Weston Pace commented on ARROW-15716:
-------------------------------------

[~vibhatha] I'm not sure {{new_table = 
dataset.to_table(filter=filter_expressions[0])}} will work.  Won't that create 
a table from just the first partition?  I think [~ldacey] was asking for 
something like {{filter=filter_expressions[0] | filter_expressions[1] | ... | 
filter_expressions[N]}}.



> [Dataset][Python] Parse a list of fragment paths to gather filters
> ------------------------------------------------------------------
>
>                 Key: ARROW-15716
>                 URL: https://issues.apache.org/jira/browse/ARROW-15716
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 7.0.0
>            Reporter: Lance Dacey
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [<pyarrow.compute.Expression (date_id == 20210813)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>, 
> <pyarrow.compute.Expression (date_id == 20210114)>]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

Reply via email to