Re: [I] [Parquet] Reduce reallocations when reading StringView in parquet [arrow-rs]

via GitHub Fri, 09 Jan 2026 14:23:38 -0800


alamb commented on issue #9059:
URL: https://github.com/apache/arrow-rs/issues/9059#issuecomment-3730782479


   I spent some time with @lyang24 working on this issue with 
https://github.com/apache/arrow-rs/pull/9093
   
   
   I printed out the buffer capacity for reads. They are almost all 0 and the 
reader doesn't then reserve enough space 
   
   You can actually see most of the reads have an empty buffer
   
   
   ```
   ewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   Read batch with 8192 rows and 105 columns
   ByteViewArrayDecoder::read called with len 5120, current views capacity: 0
   ByteViewArrayDecoder::read called with len 3072, current views capacity: 5120
   ByteViewArrayDecoder::read called with len 6144, current views capacity: 0
   ByteViewArrayDecoder::read called with len 2048, current views capacity: 6144
   
   ```
   
   I tracked it down in a debugger and the default buffer is being created here:
   
   
https://github.com/apache/arrow-rs/blob/02fa779a9cb122c5218293be3afb980832701683/parquet/src/arrow/record_reader/mod.rs#L76-L75
   
   
   `ByteViewArrayDecoderPlain` does reserve capacity 
https://github.com/apache/arrow-rs/blob/e2b2b8f5ec9ccf10bf7d584cf94e56018e2ac800/parquet/src/arrow/array_reader/byte_view_array.rs#L333-L332
   
   However, `ByteViewArrayDecoderDictionary::read` does not seem to reserve 
capacity
   
https://github.com/apache/arrow-rs/blob/e2b2b8f5ec9ccf10bf7d584cf94e56018e2ac800/parquet/src/arrow/array_reader/byte_view_array.rs#L442-L441
   
   
   <details><summary>Whole Test Program</summary>
   <p>
   
   ```rust
   use std::fs::File;
   use std::io::{BufReader, Read};
   use std::sync::Arc;
   use arrow::datatypes::{DataType, Field, FieldRef, Schema};
   use parquet::arrow::arrow_reader::{ArrowReaderOptions, 
ParquetRecordBatchReaderBuilder};
   use bytes::Bytes;
   use parquet::file::metadata::ParquetMetaDataReader;
   
   fn main() {
       let file_name = "/Users/andrewlamb/Downloads/hits/hits_0.parquet";
       println!("Opening file: {file_name}", );
       let mut file = File::open(file_name).unwrap();
       let mut bytes = Vec::new();
       file.read_to_end(&mut bytes).unwrap();
       let bytes = Bytes::from(bytes);
   
       let schema = 
string_to_view_types(ParquetRecordBatchReaderBuilder::try_new(bytes.clone()).unwrap().schema());
   
       //println!("Schema: {:?}", schema);
   
       let options = ArrowReaderOptions::new()
           .with_schema(schema);
       let reader = 
ParquetRecordBatchReaderBuilder::try_new_with_options(bytes, options).unwrap()
           .with_batch_size(8192)
           .build().unwrap();
   
       for batch in reader {
           let batch = batch.unwrap();
           println!("Read batch with {} rows and {} columns", batch.num_rows(), 
batch.num_columns());
       }
   
       println!("Done");
   
   
   }
   
   
   // Hack because the clickbench files were written with the wrong logical 
type for strings
   fn string_to_view_types(schema: &Arc<Schema>) -> Arc<Schema> {
       let fields: Vec<FieldRef> = schema
           .fields()
           .iter()
           .map(|field| {
               let existing_type = field.data_type();
               if existing_type == &DataType::Utf8 || existing_type == 
&DataType::Binary {
                   Arc::new(Field::new(
                       field.name(),
                       DataType::Utf8View,
                       field.is_nullable(),
                   ))
               } else {
                   Arc::clone(field)
               }
           })
           .collect();
       Arc::new(Schema::new(fields))
   }
   ```
   
   </p>
   </details> 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Parquet] Reduce reallocations when reading StringView in parquet [arrow-rs]

Reply via email to