samgqroberts opened a new issue, #6988:
URL: https://github.com/apache/arrow-rs/issues/6988

   **Describe the bug**
   If we create a RecordBatch with no columns (and no rows), serialize it to 
Parquet bytes via `parquet::arrow::ArrowWriter`, and attempt to deserialize it 
back via `parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder`, we 
get the error `"Repetition level must be defined for a primitive type"`.
   
   **To Reproduce**
   ```rust
   use arrow::array::RecordBatch;
   use arrow::array::RecordBatchOptions;
   use arrow::datatypes::Field;
   use arrow::datatypes::Schema;
   use bytes::Bytes;
   use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
   use parquet::arrow::ArrowWriter;
   use parquet::basic::Compression;
   use parquet::basic::ZstdLevel;
   use parquet::file::properties::WriterProperties;
   use std::sync::Arc;
   
   #[test]
   fn arrow_rs_repro() {
       // create empty record batch
       let empty_fields: Vec<Field> = vec![];
       let empty_schema = Arc::new(Schema::new(empty_fields));
       let empty_batch = RecordBatch::try_new_with_options(
           empty_schema,
           vec![],
           &RecordBatchOptions::default().with_row_count(Some(0)),
       )
       .unwrap();
   
       // write to parquet
       let mut parquet_bytes: Vec<u8> = Vec::new();
       let props = WriterProperties::builder()
           .set_compression(Compression::ZSTD(ZstdLevel::default()))
           .build();
       let mut writer =
           ArrowWriter::try_new(&mut parquet_bytes, empty_batch.schema(), 
Some(props)).unwrap();
       writer.write(&empty_batch).unwrap();
       writer.close().unwrap();
       assert_eq!(
               String::from_utf8_lossy(&parquet_bytes),
               
"PAR1\u{15}\u{2}\u{19}\u{1c}H\u{c}arrow_schema\u{15}\0\0\u{16}\0\u{19}\u{c}\u{19}\u{1c}\u{18}\u{c}ARROW:schema\u{18}L/////zAAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAAAAAA=\0\u{18}\u{19}parquet-rs
 version 53.3.0\u{19}\u{c}\0�\0\0\0PAR1"
           );
   
       // read from parquet
       let bytes = Bytes::from(parquet_bytes);
       let result = ParquetRecordBatchReaderBuilder::try_new(bytes);
       // REPRODUCTION: below fails with
       // called `Result::unwrap()` on an `Err` value: General("Repetition 
level must be defined for a primitive type")
       result.unwrap();
   }
   ```
   
   **Expected behavior**
   We should be able to reconstruct our original no-column RecordBatch.
   
   **Additional context**
   Using arrow / parquet crate versions `53.3.0`
   
   I tested this behavior in PyArrow v18.1.0 (which is backed by arrow-cpp) and 
I found that:
   - The Parquet bytes produced by PyArrow via `pyarrow.parquet.write_table()` 
for a no-column `pa.Table` can be successfully read by 
`pyarrow.parquet.read_table()` as well as the Rust 
`ParquetRecordBatchReaderBuilder` approach above.
   - `pyarrow.parquet.read_table()` can successfully read the bytes produced by 
rust (`parquet_bytes` above).
   
   I did a little debugging and found two differences in the produced file 
metadata between PyArrow and arrow-rs:
   1. The file metadata in the PyArrow-produced Parquet bytes has a single 
SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in 
the Rust-produced bytes does have the single SchemaElement, with the 
num_children: 0, but repetition_type is unspecified. This discrepancy leads to 
`schema::types::from_thrift_helper` throwing the error for the Rust-produced 
bytes.
   2. The PyArrow file metadata has a single row group with 0s for 
total_byte_size, num_rows, etc., whereas the Rust file metadata has no row 
groups. I'm not sure this is important for this particular error.
   
   This perhaps points to two distinct bugs:
   1. Like PyArrow, `ParquetRecordBatchReaderBuilder` should be able to read 
the Rust-produced bytes, forgiving the lack of repetition_type in the 
SchemaElement at least for the case where there are 0 columns.
   2. Like PyArrow, `ArrowWriter` should be able to produce a Parquet file that 
properly specifies that repetition_type.
   
   I'd also very much accept a "Hey you didn't see this config option over 
here" as a solution for my particular usage!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to