[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412375#comment-16412375 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176897588 ## File path: src/parquet/arrow/reader.h ## @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader { ::arrow::Status ReadSchemaField(int i, const std::vector& indices, std::shared_ptr<::arrow::Array>* out); + /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, the + ///ordering in row_group_indices matters. + /// \returns error Status if row_group_indices contains invalid index + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); + + /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, + /// whose columns are selected by column_indices. The ordering in row_group_indices + /// and column_indices matter. + /// \returns error Status if either row_group_indices or column_indices contains invalid + ///index + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); Review comment: > My main critique of these APIs is that we will want to provide for setting the number of rows to be read for each call to `ReadNext` Ah, I do consider this when implementing this. However the [`RecordBatch`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166) interface in arrow doesn't expose that. And I'd like to hide impl details in `parquet/arrow/reader`. To enable this, I'd like to proposal new method to `RecordBatch` then. What do you think? @wesm > Could you please open a JIRA about improving this code in this regard? Will do. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Major > Fix For: cpp-1.5.0 > > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411969#comment-16411969 ] ASF GitHub Bot commented on PARQUET-1166: - wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176849315 ## File path: src/parquet/arrow/reader.h ## @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader { ::arrow::Status ReadSchemaField(int i, const std::vector& indices, std::shared_ptr<::arrow::Array>* out); + /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, the + ///ordering in row_group_indices matters. + /// \returns error Status if row_group_indices contains invalid index + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); + + /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, + /// whose columns are selected by column_indices. The ordering in row_group_indices + /// and column_indices matter. + /// \returns error Status if either row_group_indices or column_indices contains invalid + ///index + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); Review comment: Could you please open a JIRA about improving this code in this regard? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411968#comment-16411968 ] ASF GitHub Bot commented on PARQUET-1166: - wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176849252 ## File path: src/parquet/arrow/reader.h ## @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader { ::arrow::Status ReadSchemaField(int i, const std::vector& indices, std::shared_ptr<::arrow::Array>* out); + /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, the + ///ordering in row_group_indices matters. + /// \returns error Status if row_group_indices contains invalid index + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); + + /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, + /// whose columns are selected by column_indices. The ordering in row_group_indices + /// and column_indices matter. + /// \returns error Status if either row_group_indices or column_indices contains invalid + ///index + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); Review comment: My main critique of these APIs is that we will want to provide for setting the number of rows to be read for each call to `ReadNext`, for example 1,000,000 rows at a time. Right now this is returning a whole row group at a time This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411391#comment-16411391 ] ASF GitHub Bot commented on PARQUET-1166: - xhochy commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-375667014 Looks good from my side, if @wesm does not object, I'll merge tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404402#comment-16404402 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-374107728 ping @wesm @xhochy This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392324#comment-16392324 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-371698673 >I can't speak for my company. I will have to check with my manager and technical leader. However big company always wants something in return: reputation/business benefits etc. Sorry for the delay(busy setting up cluster spark app profiling). I checked with my manager, the general response is that: > with limit amount of dev resource, we will contribute back internally features if suitable, but should not actively work on community issues. However, I will try to figure out if there is issues I can work on in my spare time This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392304#comment-16392304 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173353428 ## File path: src/parquet/arrow/reader.cc ## @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator { bool done_; }; +class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { + public: +explicit RowGroupRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + FileReader* reader) + : row_group_indices_(row_group_indices), +column_indices_(column_indices), +file_reader_(reader), +next_row_group_(0) { + file_reader_->GetSchema(column_indices_, _); +} + +~RowGroupRecordBatchReader() {} + +std::shared_ptr<::arrow::Schema> schema() const override { + return schema_; +} + +Status ReadNext(std::shared_ptr<::arrow::RecordBatch> *out) override { + if (table_ != nullptr) { // one row group has been loaded +std::shared_ptr<::arrow::RecordBatch> tmp; +table_batch_reader_->ReadNext(); +if (tmp != nullptr) { // some column chunks are left in table + *out = tmp; + return Status::OK(); +} else { // the entire table is consumed + table_batch_reader_.reset(); + table_.reset(); +} + } + + // all row groups has been consumed + if (next_row_group_ == row_group_indices_.size()) { +*out = nullptr; +return Status::OK(); + } + + RETURN_NOT_OK(file_reader_->ReadRowGroup(row_group_indices_[next_row_group_], Review comment: I am most concern about this one. We have to read one entire row group, but the caller may consume only the first N RecordBatches. I am wondering that this is not optimal This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392306#comment-16392306 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352829 ## File path: src/parquet/arrow/reader.h ## @@ -149,6 +152,13 @@ class PARQUET_EXPORT FileReader { ::arrow::Status ReadSchemaField(int i, const std::vector& indices, std::shared_ptr<::arrow::Array>* out); + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); + + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); Review comment: Of course, will do. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392305#comment-16392305 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352853 ## File path: src/parquet/arrow/reader.cc ## @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator { bool done_; }; +class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { + public: +explicit RowGroupRecordBatchReader(const std::vector& row_group_indices, Review comment: will do This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390078#comment-16390078 ] ASF GitHub Bot commented on PARQUET-1166: - wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r172741404 ## File path: src/parquet/arrow/reader.h ## @@ -149,6 +152,13 @@ class PARQUET_EXPORT FileReader { ::arrow::Status ReadSchemaField(int i, const std::vector& indices, std::shared_ptr<::arrow::Array>* out); + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); + + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); Review comment: Can you add brief doxygen comments to these new methods? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386004#comment-16386004 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-370403357 Ping @wesm @xhochy, do you have any comments? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380268#comment-16380268 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r171237865 ## File path: src/parquet/arrow/writer.h ## @@ -31,7 +31,6 @@ namespace arrow { class Array; class MemoryPool; class PrimitiveArray; -class RowBatch; Review comment: RowBatch is never used any more and is renamed to RecordBatch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380267#comment-16380267 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy opened a new pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445 Ping @xhochy @wesm. Sorry for the delay, I finally get some time to finish this feature. This is just work in progress, but I want to get feedback before any further. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266954#comment-16266954 ] Xianjin YE commented on PARQUET-1166: - All right then, I will send pr soon and will try to reuse Arrow's code whenever possible. > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266926#comment-16266926 ] Wes McKinney commented on PARQUET-1166: --- Sounds good to me. This is actually already basically the intent of ARROW-1012 > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v6.4.14#64029)