[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned PARQUET-1166: ------------------------------------- Assignee: Xianjin YE > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > ----------------------------------------------------------------- > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp > Reporter: Xianjin YE > Assignee: Xianjin YE > Priority: Major > Fix For: cpp-1.5.0 > > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector<int>& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector<int>& > row_group_indices, > const > std::vector<int>& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)