[ 
https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202405#comment-17202405
 ] 

Joris Van den Bossche commented on ARROW-10100:
-----------------------------------------------

>From discussion at 
>https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the 
>dataset API in their parquet reader), it might be useful to somehow "subset" 
>or read a subset of a ParquetFileFragment for a specific set of row group ids.

Use cases:

* Read only a set of row groups ids (this is similar as 
{{ParquetFile.read_row_groups}}), eg because you want to control the size of 
the resulting table by reading subsets of row groups
* Get a ParquetFileFragment with a subset of row groups (eg based on a filter) 
to then eg get the statistics of only those row groups

The first case could for example be solved by adding a {{row_groups}} keyword 
to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to 
the parquet format, and we should then probably also add it to {{scan}} et al).

The second case is something you can in principle do yourself manually by 
recreating a fragment with {{fragment.format.make_fragment(fragment.path, ..., 
row_groups=[...])}}. However, this is a) a bit cumbersome and b) statistics 
might need to be parsed again?  
The statistics of a set of filtered row groups could also be obtained by using 
{{split_by_row_group(filter)}} (and then get the statistics of each of the 
fragments), but if you then want a single fragment, you need to recreate a 
fragment with the obtained row group ids.

So one idea I have now (but mostly brainstorming here). Would it be useful to 
have a method to create a "subsetted" ParquetFileFragment, either based on a 
list of row group ids ({{fragment.subset(row_groups=[...])}} or either based on 
a filter ({{fragment.subset(filter=...)}}, which would be equivalent as 
split_by_row_group+recombining into a single fragment) ?

cc [~bkietz] [~rjzamora]



> [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of 
> row group ids
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10100
>                 URL: https://issues.apache.org/jira/browse/ARROW-10100
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Joris Van den Bossche
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to