[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412375#comment-16412375
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-----------------------------------------

advancedxy commented on a change in pull request #445: PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176897588
 
 

 ##########
 File path: src/parquet/arrow/reader.h
 ##########
 @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector<int>& indices,
                                   std::shared_ptr<::arrow::Array>* out);
 
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices, the
+  ///    ordering in row_group_indices matters.
+  /// \returns error Status if row_group_indices contains invalid index
+  ::arrow::Status GetRecordBatchReader(const std::vector<int>& 
row_group_indices,
+                                       
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices,
+  ///     whose columns are selected by column_indices. The ordering in 
row_group_indices
+  ///     and column_indices matter.
+  /// \returns error Status if either row_group_indices or column_indices 
contains invalid
+  ///    index
+  ::arrow::Status GetRecordBatchReader(const std::vector<int>& 
row_group_indices,
+                                       const std::vector<int>& column_indices,
+                                       
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   >  My main critique of these APIs is that we will want to provide for 
setting the number of rows to be read for each call to `ReadNext`
   
   Ah, I do consider this when implementing this. However the 
[`RecordBatch`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166)
 interface in arrow doesn't expose that. And I'd like to hide impl details in 
`parquet/arrow/reader`. To enable this, I'd like to proposal new method to 
`RecordBatch` then.
   What do you think? @wesm 
   
   > Could you please open a JIRA about improving this code in this regard?
   
   Will do.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1166
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1166
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cpp
>            Reporter: Xianjin YE
>            Assignee: Xianjin YE
>            Priority: Major
>             Fix For: cpp-1.5.0
>
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector<int>& 
> row_group_indices,
>                                                                 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
>                 
> ::arrow::Status GetRecordBatchReader(const std::vector<int>& 
> row_group_indices,
>                                                                 const 
> std::vector<int>& column_indices,
>                                                                 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to