Ted-Jiang commented on PR #1977: URL: https://github.com/apache/arrow-rs/pull/1977#issuecomment-1172476897
> I've had a quick review, unfortunately I think this is missing a key detail. In particular the arrow writer must read the same records from each of its columns. As written this simply skips reading pruned pages from columns. There is no relationship between the page boundaries across columns within a parquet, and therefore this will return different rows for each of the columns. Thanks @tustvold, your are right. Maybe I made the title confusingš. as you mentioned in [#1791 (review)]. (https://github.com/apache/arrow-rs/pull/1791#pullrequestreview-996352857): >Pass row selection down to RecordReader >Add a skip_next_page to PageReader >Add a skip_values to ColumnValueDecoder This pr is only about the `skip_next_page` part, we will only return the needed page metadata in iterator. As make the same records from each of its columns (row align), i prefer support in next pr. I prefer to separate them to avoid huge PR and conflict. If you prefer to combine them, I will make this in progress and keep developing. > As described in [#1791 (review)](https://github.com/apache/arrow-rs/pull/1791#pullrequestreview-996352857), you will need to extract the row selection in addition to the page selection, and push this into RecordReader and ColumnValueDecoder. This will also make the API clearer, as we aren't going behind their back and skipping pages at the block-level As above, need pass the `row_ranges` to ColumnValueReader in future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
