[ https://issues.apache.org/jira/browse/ARROW-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202405#comment-17202405 ]
Joris Van den Bossche commented on ARROW-10100: ----------------------------------------------- >From discussion at >https://github.com/dask/dask/pull/6534#issuecomment-698723009 (dask using the >dataset API in their parquet reader), it might be useful to somehow "subset" >or read a subset of a ParquetFileFragment for a specific set of row group ids. Use cases: * Read only a set of row groups ids (this is similar as {{ParquetFile.read_row_groups}}), eg because you want to control the size of the resulting table by reading subsets of row groups * Get a ParquetFileFragment with a subset of row groups (eg based on a filter) to then eg get the statistics of only those row groups The first case could for example be solved by adding a {{row_groups}} keyword to {{ParquetFileFragment.to_table}} (but, this is then a keyword specific to the parquet format, and we should then probably also add it to {{scan}} et al). The second case is something you can in principle do yourself manually by recreating a fragment with {{fragment.format.make_fragment(fragment.path, ..., row_groups=[...])}}. However, this is a) a bit cumbersome and b) statistics might need to be parsed again? The statistics of a set of filtered row groups could also be obtained by using {{split_by_row_group(filter)}} (and then get the statistics of each of the fragments), but if you then want a single fragment, you need to recreate a fragment with the obtained row group ids. So one idea I have now (but mostly brainstorming here). Would it be useful to have a method to create a "subsetted" ParquetFileFragment, either based on a list of row group ids ({{fragment.subset(row_groups=[...])}} or either based on a filter ({{fragment.subset(filter=...)}}, which would be equivalent as split_by_row_group+recombining into a single fragment) ? cc [~bkietz] [~rjzamora] > [C++][Dataset] Ability to read/subset a ParquetFileFragment with given set of > row group ids > ------------------------------------------------------------------------------------------- > > Key: ARROW-10100 > URL: https://issues.apache.org/jira/browse/ARROW-10100 > Project: Apache Arrow > Issue Type: Improvement > Reporter: Joris Van den Bossche > Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)