alamb commented on issue #9059: URL: https://github.com/apache/arrow-rs/issues/9059#issuecomment-3730782479
I spent some time with @lyang24 working on this issue with https://github.com/apache/arrow-rs/pull/9093 I printed out the buffer capacity for reads. They are almost all 0 and the reader doesn't then reserve enough space You can actually see most of the reads have an empty buffer ``` ewArrayDecoder::read called with len 8192, current views capacity: 0 ByteViewArrayDecoder::read called with len 8192, current views capacity: 0 ByteViewArrayDecoder::read called with len 8192, current views capacity: 0 ByteViewArrayDecoder::read called with len 8192, current views capacity: 0 ByteViewArrayDecoder::read called with len 8192, current views capacity: 0 ByteViewArrayDecoder::read called with len 8192, current views capacity: 0 ByteViewArrayDecoder::read called with len 8192, current views capacity: 0 Read batch with 8192 rows and 105 columns ByteViewArrayDecoder::read called with len 5120, current views capacity: 0 ByteViewArrayDecoder::read called with len 3072, current views capacity: 5120 ByteViewArrayDecoder::read called with len 6144, current views capacity: 0 ByteViewArrayDecoder::read called with len 2048, current views capacity: 6144 ``` I tracked it down in a debugger and the default buffer is being created here: https://github.com/apache/arrow-rs/blob/02fa779a9cb122c5218293be3afb980832701683/parquet/src/arrow/record_reader/mod.rs#L76-L75 `ByteViewArrayDecoderPlain` does reserve capacity https://github.com/apache/arrow-rs/blob/e2b2b8f5ec9ccf10bf7d584cf94e56018e2ac800/parquet/src/arrow/array_reader/byte_view_array.rs#L333-L332 However, `ByteViewArrayDecoderDictionary::read` does not seem to reserve capacity https://github.com/apache/arrow-rs/blob/e2b2b8f5ec9ccf10bf7d584cf94e56018e2ac800/parquet/src/arrow/array_reader/byte_view_array.rs#L442-L441 <details><summary>Whole Test Program</summary> <p> ```rust use std::fs::File; use std::io::{BufReader, Read}; use std::sync::Arc; use arrow::datatypes::{DataType, Field, FieldRef, Schema}; use parquet::arrow::arrow_reader::{ArrowReaderOptions, ParquetRecordBatchReaderBuilder}; use bytes::Bytes; use parquet::file::metadata::ParquetMetaDataReader; fn main() { let file_name = "/Users/andrewlamb/Downloads/hits/hits_0.parquet"; println!("Opening file: {file_name}", ); let mut file = File::open(file_name).unwrap(); let mut bytes = Vec::new(); file.read_to_end(&mut bytes).unwrap(); let bytes = Bytes::from(bytes); let schema = string_to_view_types(ParquetRecordBatchReaderBuilder::try_new(bytes.clone()).unwrap().schema()); //println!("Schema: {:?}", schema); let options = ArrowReaderOptions::new() .with_schema(schema); let reader = ParquetRecordBatchReaderBuilder::try_new_with_options(bytes, options).unwrap() .with_batch_size(8192) .build().unwrap(); for batch in reader { let batch = batch.unwrap(); println!("Read batch with {} rows and {} columns", batch.num_rows(), batch.num_columns()); } println!("Done"); } // Hack because the clickbench files were written with the wrong logical type for strings fn string_to_view_types(schema: &Arc<Schema>) -> Arc<Schema> { let fields: Vec<FieldRef> = schema .fields() .iter() .map(|field| { let existing_type = field.data_type(); if existing_type == &DataType::Utf8 || existing_type == &DataType::Binary { Arc::new(Field::new( field.name(), DataType::Utf8View, field.is_nullable(), )) } else { Arc::clone(field) } }) .collect(); Arc::new(Schema::new(fields)) } ``` </p> </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
