alamb commented on PR #9093:
URL: https://github.com/apache/arrow-rs/pull/9093#issuecomment-3730752608

   Update:
   
   I made a small test program (below) and printed out the capacities
   
   ```diff
   diff --git a/parquet/src/arrow/buffer/view_buffer.rs 
b/parquet/src/arrow/buffer/view_buffer.rs
   index 0343047da6..d87b494b46 100644
   --- a/parquet/src/arrow/buffer/view_buffer.rs
   +++ b/parquet/src/arrow/buffer/view_buffer.rs
   @@ -35,6 +35,7 @@ pub struct ViewBuffer {
    impl ViewBuffer {
        /// Create a new ViewBuffer with capacity for the specified number of 
views
        pub fn with_capacity(capacity: usize) -> Self {
   +        println!("Creating ViewBuffer with capacity {}", capacity);
            Self {
                views: Vec::with_capacity(capacity),
                buffers: Vec::new(),
   ```
   
   
   
   Here is what they are:
   
   ```
   Creating ViewBuffer with capacity 4165
   Creating ViewBuffer with capacity 10986
   Creating ViewBuffer with capacity 9772
   Creating ViewBuffer with capacity 35
   Creating ViewBuffer with capacity 36
   Creating ViewBuffer with capacity 26
   Creating ViewBuffer with capacity 1
   ....
   Creating ViewBuffer with capacity 32
   Creating ViewBuffer with capacity 16
   Creating ViewBuffer with capacity 218
   Creating ViewBuffer with capacity 154
   Creating ViewBuffer with capacity 154
   Creating ViewBuffer with capacity 27
   ```
   
   I think those are the number of rows in the *dictionary* (not the view 
themselves)
   
   I also then printed out the actual capacites
   
   ```diff
   diff --git a/parquet/src/arrow/array_reader/byte_view_array.rs 
b/parquet/src/arrow/array_reader/byte_view_array.rs
   index 8e690c574d..3ea6f08a29 100644
   --- a/parquet/src/arrow/array_reader/byte_view_array.rs
   +++ b/parquet/src/arrow/array_reader/byte_view_array.rs
   @@ -259,6 +259,8 @@ impl ByteViewArrayDecoder {
            len: usize,
            dict: Option<&ViewBuffer>,
        ) -> Result<usize> {
   +        println!("ByteViewArrayDecoder::read called with len {}, current 
views capacity: {}", len, out.views.capacity());
   +
            match self {
                ByteViewArrayDecoder::Plain(d) => d.read(out, len),
                ByteViewArrayDecoder::Dictionary(d) => {
   ```
   
   
   You can actually see most of the reads have an empty buffer
   
   
   ```
   ewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   ByteViewArrayDecoder::read called with len 8192, current views capacity: 0
   Read batch with 8192 rows and 105 columns
   ByteViewArrayDecoder::read called with len 5120, current views capacity: 0
   ByteViewArrayDecoder::read called with len 3072, current views capacity: 5120
   ByteViewArrayDecoder::read called with len 6144, current views capacity: 0
   ByteViewArrayDecoder::read called with len 2048, current views capacity: 6144
   
   ```
   
   I tracked it down in a debugger and the default buffer is being created here:
   
   
https://github.com/apache/arrow-rs/blob/02fa779a9cb122c5218293be3afb980832701683/parquet/src/arrow/record_reader/mod.rs#L76-L75
   
   
   <details><summary>Whole Test Progarm</summary>
   <p>
   
   ```rust
   use std::fs::File;
   use std::io::{BufReader, Read};
   use std::sync::Arc;
   use arrow::datatypes::{DataType, Field, FieldRef, Schema};
   use parquet::arrow::arrow_reader::{ArrowReaderOptions, 
ParquetRecordBatchReaderBuilder};
   use bytes::Bytes;
   use parquet::file::metadata::ParquetMetaDataReader;
   
   fn main() {
       let file_name = "/Users/andrewlamb/Downloads/hits/hits_0.parquet";
       println!("Opening file: {file_name}", );
       let mut file = File::open(file_name).unwrap();
       let mut bytes = Vec::new();
       file.read_to_end(&mut bytes).unwrap();
       let bytes = Bytes::from(bytes);
   
       let schema = 
string_to_view_types(ParquetRecordBatchReaderBuilder::try_new(bytes.clone()).unwrap().schema());
   
       //println!("Schema: {:?}", schema);
   
       let options = ArrowReaderOptions::new()
           .with_schema(schema);
       let reader = 
ParquetRecordBatchReaderBuilder::try_new_with_options(bytes, options).unwrap()
           .with_batch_size(8192)
           .build().unwrap();
   
       for batch in reader {
           let batch = batch.unwrap();
           println!("Read batch with {} rows and {} columns", batch.num_rows(), 
batch.num_columns());
       }
   
       println!("Done");
   
   
   }
   
   
   // Hack because the clickbench files were written with the wrong logical 
type for strings
   fn string_to_view_types(schema: &Arc<Schema>) -> Arc<Schema> {
       let fields: Vec<FieldRef> = schema
           .fields()
           .iter()
           .map(|field| {
               let existing_type = field.data_type();
               if existing_type == &DataType::Utf8 || existing_type == 
&DataType::Binary {
                   Arc::new(Field::new(
                       field.name(),
                       DataType::Utf8View,
                       field.is_nullable(),
                   ))
               } else {
                   Arc::clone(field)
               }
           })
           .collect();
       Arc::new(Schema::new(fields))
   }
   ```
   
   </p>
   </details> 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to