alamb commented on code in PR #9118:
URL: https://github.com/apache/arrow-rs/pull/9118#discussion_r2678607392
##########
parquet/src/arrow/push_decoder/reader_builder/mod.rs:
##########
@@ -654,57 +660,6 @@ impl RowGroupReaderBuilder {
}
}
-/// Override the selection strategy if needed.
-///
-/// Some pages can be skipped during row-group construction if they are not
read
-/// by the selections. This means that the data pages for those rows are never
-/// loaded and definition/repetition levels are never read. When using
-/// `RowSelections` selection works because `skip_records()` handles this
-/// case and skips the page accordingly.
-///
-/// However, with the current mask design, all values must be read and decoded
-/// and then a mask filter is applied. Thus if any pages are skipped during
-/// row-group construction, the data pages are missing and cannot be decoded.
-///
-/// A simple example:
-/// * the page size is 2, the mask is 100001, row selection should be read(1)
skip(4) read(1)
-/// * the `ColumnChunkData` would be page1(10), page2(skipped), page3(01)
-///
-/// Using the row selection to skip(4), page2 won't be read at all, so in this
-/// case we can't decode all the rows and apply a mask. To correctly apply the
-/// bit mask, we need all 6 values be read, but page2 is not in memory.
-fn override_selector_strategy_if_needed(
Review Comment:
nice -- the idea is to avoid this function 👍
##########
parquet/src/arrow/arrow_reader/selection.rs:
##########
@@ -800,14 +806,26 @@ impl MaskCursor {
let mut chunk_rows = 0;
let mut selected_rows = 0;
- // Advance until enough rows have been selected to satisfy the
batch size,
- // or until the mask is exhausted. This mirrors the behaviour of
the legacy
- // `RowSelector` queue-based iteration.
- while cursor < mask.len() && selected_rows < batch_size {
+ let mut page_end = mask.len();
+ if let Some(pages) = page_locations {
+ for loc in pages {
Review Comment:
I am also a little worried that this loop will take too long (it is O(N^2)
in the number of pages as each time it looks through all pages
Maybe we could potentially add a PageLocationIterator to the cursor itself
(so we know where to pick up)
##########
parquet/src/arrow/arrow_reader/selection.rs:
##########
@@ -800,14 +806,26 @@ impl MaskCursor {
let mut chunk_rows = 0;
let mut selected_rows = 0;
- // Advance until enough rows have been selected to satisfy the
batch size,
- // or until the mask is exhausted. This mirrors the behaviour of
the legacy
- // `RowSelector` queue-based iteration.
- while cursor < mask.len() && selected_rows < batch_size {
+ let mut page_end = mask.len();
+ if let Some(pages) = page_locations {
+ for loc in pages {
+ let page_start =
loc.first_row_index.try_into().unwrap_or(usize::MAX);
Review Comment:
I think this could just be an unwrap -- if the page index location is less
than 0 or doesn't fit in a usize it is invalid and we should have stopped
reading the file earlier I think
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -1407,6 +1420,7 @@ impl ParquetRecordBatchReader {
array_reader,
schema: Arc::new(schema),
read_plan,
+ page_offsets: page_offsets.map(|slice| Arc::new(slice.to_vec())),
Review Comment:
So I think this will effectively will copy the entire
OffsetIndexMetadataData structure (which I worry could be quite large)
I wonder if we need to find a way to avoid this (e.g. making the entire
thing Arc'd in
https://github.com/apache/arrow-rs/blob/67e04e758f1e62ec3d78d2f678daf433a4c54e30/parquet/src/file/metadata/mod.rs#L197-L196
somehow 🤔 )
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]