[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392324#comment-16392324 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-371698673 >I can't speak for my company. I will have to check with my manager and technical leader. However big company always wants something in return: reputation/business benefits etc. Sorry for the delay(busy setting up cluster spark app profiling). I checked with my manager, the general response is that: > with limit amount of dev resource, we will contribute back internally features if suitable, but should not actively work on community issues. However, I will try to figure out if there is issues I can work on in my spare time This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392304#comment-16392304 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173353428 ## File path: src/parquet/arrow/reader.cc ## @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator { bool done_; }; +class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { + public: +explicit RowGroupRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + FileReader* reader) + : row_group_indices_(row_group_indices), +column_indices_(column_indices), +file_reader_(reader), +next_row_group_(0) { + file_reader_->GetSchema(column_indices_, &schema_); +} + +~RowGroupRecordBatchReader() {} + +std::shared_ptr<::arrow::Schema> schema() const override { + return schema_; +} + +Status ReadNext(std::shared_ptr<::arrow::RecordBatch> *out) override { + if (table_ != nullptr) { // one row group has been loaded +std::shared_ptr<::arrow::RecordBatch> tmp; +table_batch_reader_->ReadNext(&tmp); +if (tmp != nullptr) { // some column chunks are left in table + *out = tmp; + return Status::OK(); +} else { // the entire table is consumed + table_batch_reader_.reset(); + table_.reset(); +} + } + + // all row groups has been consumed + if (next_row_group_ == row_group_indices_.size()) { +*out = nullptr; +return Status::OK(); + } + + RETURN_NOT_OK(file_reader_->ReadRowGroup(row_group_indices_[next_row_group_], Review comment: I am most concern about this one. We have to read one entire row group, but the caller may consume only the first N RecordBatches. I am wondering that this is not optimal This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392306#comment-16392306 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352829 ## File path: src/parquet/arrow/reader.h ## @@ -149,6 +152,13 @@ class PARQUET_EXPORT FileReader { ::arrow::Status ReadSchemaField(int i, const std::vector& indices, std::shared_ptr<::arrow::Array>* out); + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); + + ::arrow::Status GetRecordBatchReader(const std::vector& row_group_indices, + const std::vector& column_indices, + std::shared_ptr<::arrow::RecordBatchReader>* out); Review comment: Of course, will do. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
[ https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392305#comment-16392305 ] ASF GitHub Bot commented on PARQUET-1166: - advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in parquet/arrow/reader URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352853 ## File path: src/parquet/arrow/reader.cc ## @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator { bool done_; }; +class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { + public: +explicit RowGroupRecordBatchReader(const std::vector& row_group_indices, Review comment: will do This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h > - > > Key: PARQUET-1166 > URL: https://issues.apache.org/jira/browse/PARQUET-1166 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Xianjin YE >Priority: Major > > Hi, I'd like to proposal a new API to better support splittable reading for > Parquet File. > The intent for this API is that we can selective reading RowGroups(normally > be contiguous, but can be arbitrary as long as the row_group_idxes are sorted > and unique, [1, 3, 5] for example). > The proposed API would be something like this: > {code:java} > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > > ::arrow::Status GetRecordBatchReader(const std::vector& > row_group_indices, > const > std::vector& column_indices, > > std::shared_ptr<::arrow::RecordBatchReader>* out); > {code} > With new API, we can split Parquet file into RowGroups and can be processed > by multiple tasks(maybe be on different hosts, like the Map task in MapReduce) > [~wesmckinn][~xhochy] What do you think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Date for next Parquet sync
Actually because of Daylight saving time we will have one less hour next week. https://www.timeanddate.com/worldclock/meetingdetails.html?year=2018&month=3&day=13&hour=17&min=0&sec=0&p1=224&p2=50&p3=195 Location Local Time Time Zone UTC Offset San Francisco (USA - California) Tuesday, March 13, 2018 at 10:00:00 am PDT UTC-7 hours Budapest (Hungary) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1 hour Paris (France - Île-de-France) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1 hour Corresponding UTC (GMT) Tuesday, March 13, 2018 at 17:00:00 On Thu, Mar 8, 2018 at 4:12 PM, Julien Le Dem wrote: > or 10am PST but it's a little late for the team in Budapest. > > On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem > wrote: > >> I'm sorry, it turns out I now have a conflict at this particular time. >> Maybe Wednesday? >> >> On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker wrote: >> >>> Hi All, >>> >>> It has been almost 3 weeks since the last sync and there are a bunch of >>> ongoing discussions on the mailing list. Let's find a date for the next >>> Parquet community sync. Last time we met on a Wednesday, so this time it >>> should be Tuesday. >>> >>> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That >>> allows us to get back to the biweekly cadence without overlapping with >>> the >>> Arrow sync, which happens this week. >>> >>> Please speak up if that time does not work for you. >>> >>> Cheers, Lars >>> >> >> >
Re: Date for next Parquet sync
or 10am PST but it's a little late for the team in Budapest. On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem wrote: > I'm sorry, it turns out I now have a conflict at this particular time. > Maybe Wednesday? > > On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker wrote: > >> Hi All, >> >> It has been almost 3 weeks since the last sync and there are a bunch of >> ongoing discussions on the mailing list. Let's find a date for the next >> Parquet community sync. Last time we met on a Wednesday, so this time it >> should be Tuesday. >> >> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That >> allows us to get back to the biweekly cadence without overlapping with the >> Arrow sync, which happens this week. >> >> Please speak up if that time does not work for you. >> >> Cheers, Lars >> > >
Re: Date for next Parquet sync
I'm sorry, it turns out I now have a conflict at this particular time. Maybe Wednesday? On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker wrote: > Hi All, > > It has been almost 3 weeks since the last sync and there are a bunch of > ongoing discussions on the mailing list. Let's find a date for the next > Parquet community sync. Last time we met on a Wednesday, so this time it > should be Tuesday. > > I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That > allows us to get back to the biweekly cadence without overlapping with the > Arrow sync, which happens this week. > > Please speak up if that time does not work for you. > > Cheers, Lars >
Re: Parquet repositories moved to Apache GitBox service
Thanks, Uwe! On Thu, Mar 8, 2018 at 2:05 PM, Uwe L. Korn wrote: > The parquet-mr and parquet-format repositories are now moved to GitBox. Thus > the remotes of these repos changed to: > > https://gitbox.apache.org/repos/asf?p=parquet-format.git > https://gitbox.apache.org/repos/asf?p=parquet-mr.git > > You will also be able now to push to the GitHub remote (e.g. to use the "Let > maintainers push to this PR" feature), therefore you need to activate the > linking of your ASF and GitHub accounts. I hope to find time tomorrow to > update the merge and release scripts. > > Uwe
[jira] [Created] (PARQUET-1241) Use LZ4 frame format
Lawrence Chan created PARQUET-1241: -- Summary: Use LZ4 frame format Key: PARQUET-1241 URL: https://issues.apache.org/jira/browse/PARQUET-1241 Project: Parquet Issue Type: Improvement Components: parquet-cpp, parquet-format Reporter: Lawrence Chan The parquet-format spec doesn't currently specify whether lz4-compressed data should be framed or not. We should choose one and make it explicit in the spec, as they are not inter-operable. After some discussions with others [1], we think it would be beneficial to use the framed format, which adds a small header in exchange for more self-contained decompression as well as a richer feature set (checksums, parallel decompression, etc). The current arrow implementation compresses using the lz4 block format, and this would need to be updated when we add the spec clarification. If backwards compatibility is a concern, I would suggest adding an additional LZ4_FRAMED compression type, but that may be more noise than anything. [1] https://github.com/dask/fastparquet/issues/314 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Parquet repositories moved to Apache GitBox service
The parquet-mr and parquet-format repositories are now moved to GitBox. Thus the remotes of these repos changed to: https://gitbox.apache.org/repos/asf?p=parquet-format.git https://gitbox.apache.org/repos/asf?p=parquet-mr.git You will also be able now to push to the GitHub remote (e.g. to use the "Let maintainers push to this PR" feature), therefore you need to activate the linking of your ASF and GitHub accounts. I hope to find time tomorrow to update the merge and release scripts. Uwe