m-mueller678 opened a new issue, #7667:
URL: https://github.com/apache/arrow-rs/issues/7667

   **Which part is this question about**
   library api: multithreaded reading without converting to arrow
   
   **Describe your question**
   I am trying to read a single file from multiple threads. The `ColumnReader`s 
return lots of protocol errors and sometimes panics. I am not sure if this is a 
bug or if I am using the library wrong. If the latter, I'd love to know what 
the correct way is.
   
   **Additional context**
   I am guessing that the issue is that multiple `RowGroupReader`s share the 
same file handle and interfere with each other by seeking. However, 
`SerializedFileReader::get_row_group` seems to be inviting me to do exactly 
this kind of sharing by taking a shared reference to the `SerializedFileReader`.
   
   In my example, I use the lineitem table from TPC-H, generated using 
[tpchgen-cli](https://github.com/clflushopt/tpchgen-rs):
   ```sh
   mkdir tpch-data
   cd tpch-data
   tpchgen-cli -s 1 --format=parquet
   cd ..
   ```
   
   Here is the code:
   ```rust
   use parquet::column::reader::ColumnReader;
   use parquet::file::metadata::RowGroupMetaData;
   use parquet::file::reader::{FileReader, RowGroupReader, 
SerializedFileReader};
   use rayon::prelude::*;
   
   fn find_col(metadata: &RowGroupMetaData, reader: &dyn RowGroupReader, name: 
&str) -> ColumnReader {
       for (i, x) in metadata.columns().iter().enumerate() {
           if x.column_descr().name() == name {
               return reader.get_column_reader(i).unwrap();
           }
       }
       panic!("column {name:?} not found");
   }
   
   fn main() {
       let reader =
           
SerializedFileReader::new(std::fs::File::open("./tpch-data/lineitem.parquet").unwrap())
               .unwrap();
       let metadata = reader.metadata();
       (0..metadata.num_row_groups())
           .into_par_iter()
           .for_each(|i| {
               let metadata = &metadata.row_group(i);
               let reader = reader.get_row_group(i).unwrap();
               let ColumnReader::Int64ColumnReader(mut reader_l_quantity_112) =
                   find_col(metadata, &*reader, "l_quantity")
               else {
                   panic!()
               };
               let mut read_buffer_l_quantity_113 = Vec::new();
               loop {
                   let read_count_126 = reader_l_quantity_112
                       .read_records(10000, None, None, &mut 
read_buffer_l_quantity_113)
                       .unwrap()
                       .0;
                   if read_count_126 == 0 {
                       break;
                   }
               }
           })
   }
   ```
   
   Here are some of the errors I am seeing:
   ```
   thread '<unnamed>' panicked at src/bin/parquet_issue.rs:34:22:
   called `Result::unwrap()` on an `Err` value: External(ProtocolError { kind: 
Unknown, message: "cannot skip field type Stop" })
   
   thread '<unnamed>' panicked at src/bin/parquet_issue.rs:34:22:
   called `Result::unwrap()` on an `Err` value: External(ProtocolError { kind: 
Unknown, message: "missing required field PageHeader.type_" })
   
   thread '<unnamed>' panicked at 
$HOME/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/parquet-55.1.0/src/encodings/rle.rs:485:58:
   index out of bounds: the len is 50 but the index is 58
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to