[GitHub] [arrow-rs] Ted-Jiang commented on pull request #1977: Enable serialized_reader read specific Page by passing row ranges.

GitBox Fri, 01 Jul 2022 08:44:23 -0700


Ted-Jiang commented on PR #1977:
URL: https://github.com/apache/arrow-rs/pull/1977#issuecomment-1172476897


   > I've had a quick review, unfortunately I think this is missing a key 
detail. In particular the arrow writer must read the same records from each of 
its columns. As written this simply skips reading pruned pages from columns. 
There is no relationship between the page boundaries across columns within a 
parquet, and therefore this will return different rows for each of the columns.
   
   Thanks @tustvold, your are right. Maybe I made the title confusing😭. as you 
mentioned in  [#1791 (review)]. 
(https://github.com/apache/arrow-rs/pull/1791#pullrequestreview-996352857):
   
   >Pass row selection down to RecordReader
   >Add a skip_next_page to PageReader
   >Add a skip_values to ColumnValueDecoder
   
   This pr is only about the `skip_next_page` part, we will only return the 
needed page metadata in iterator. As make the  same records from each of its 
columns (row align), i prefer support in next pr. I prefer to separate them to 
avoid huge PR and conflict. If you prefer to combine them, I will make this in 
progress and keep developing.
   
   > As described in [#1791 
(review)](https://github.com/apache/arrow-rs/pull/1791#pullrequestreview-996352857),
 you will need to extract the row selection in addition to the page selection, 
and push this into RecordReader and ColumnValueDecoder. This will also make the 
API clearer, as we aren't going behind their back and skipping pages at the 
block-level
   As above, need pass the `row_ranges` to ColumnValueReader in future.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] Ted-Jiang commented on pull request #1977: Enable serialized_reader read specific Page by passing row ranges.

Reply via email to