zhuqi-lucas commented on PR #7409:
URL: https://github.com/apache/arrow-rs/pull/7409#issuecomment-2799913654

   Thank you @alamb for review and good question, i try to explain:
   
   1. Before the cache page PR, we will not use the low level skip_next_page to 
trigger the dict page bug, because the only place we use the low level api is:
   
   ```rust
       /// Skips over `num_records` records, where records are delimited by 
repetition levels of 0
       ///
       /// # Returns
       ///
       /// Returns the number of records skipped
       pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
           let mut remaining_records = num_records;
           while remaining_records != 0 {
               if self.num_buffered_values == self.num_decoded_values {
                   let metadata = match self.page_reader.peek_next_page()? {
                       None => return Ok(num_records - remaining_records),
                       Some(metadata) => metadata,
                   };
   
                   // If dictionary, we must read it
                   if metadata.is_dict {
                       self.read_dictionary_page()?;
                       continue;
                   }
   
                   // If page has less rows than the remaining records to
                   // be skipped, skip entire page
                   let rows = metadata.num_rows.or_else(|| {
                       // If no repetition levels, num_levels == num_rows
                       self.rep_level_decoder
                           .is_none()
                           .then_some(metadata.num_levels)?
                   });
   
                   if let Some(rows) = rows {
                       if rows <= remaining_records {
                           self.page_reader.skip_next_page()?;
                           remaining_records -= rows;
                           continue;
                       }
                   }
               ..............
   }
   ```
   
   
   And the logic here, we already exclude the dict page case for above high 
level API, so we will not hit the bug:
   ```rust
                  // If dictionary, we must read it
                   if metadata.is_dict {
                       self.read_dictionary_page()?;
                       continue;
                   }
   ```
   So the following code will not trigger the dict page skip:
   ```rust
                 if let Some(rows) = rows {
                       if rows <= remaining_records {
                           self.page_reader.skip_next_page()?;
                           remaining_records -= rows;
                           continue;
                       }
                   }
   ```
   
   
   2. But we will use the low level API after the cache page PR which will hit 
the skip dict page bug.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to