zhuqi-lucas commented on PR #7409:
URL: https://github.com/apache/arrow-rs/pull/7409#issuecomment-2799913654
Thank you @alamb for review and good question, i try to explain:
1. Before the cache page PR, we will not use the low level skip_next_page to
trigger the dict page bug, because the only place we use the low level api is:
```rust
/// Skips over `num_records` records, where records are delimited by
repetition levels of 0
///
/// # Returns
///
/// Returns the number of records skipped
pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
let mut remaining_records = num_records;
while remaining_records != 0 {
if self.num_buffered_values == self.num_decoded_values {
let metadata = match self.page_reader.peek_next_page()? {
None => return Ok(num_records - remaining_records),
Some(metadata) => metadata,
};
// If dictionary, we must read it
if metadata.is_dict {
self.read_dictionary_page()?;
continue;
}
// If page has less rows than the remaining records to
// be skipped, skip entire page
let rows = metadata.num_rows.or_else(|| {
// If no repetition levels, num_levels == num_rows
self.rep_level_decoder
.is_none()
.then_some(metadata.num_levels)?
});
if let Some(rows) = rows {
if rows <= remaining_records {
self.page_reader.skip_next_page()?;
remaining_records -= rows;
continue;
}
}
..............
}
```
And the logic here, we already exclude the dict page case for above high
level API, so we will not hit the bug:
```rust
// If dictionary, we must read it
if metadata.is_dict {
self.read_dictionary_page()?;
continue;
}
```
So the following code will not trigger the dict page skip:
```rust
if let Some(rows) = rows {
if rows <= remaining_records {
self.page_reader.skip_next_page()?;
remaining_records -= rows;
continue;
}
}
```
2. But we will use the low level API after the cache page PR which will hit
the skip dict page bug.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]