samgqroberts opened a new issue, #6988:
URL: https://github.com/apache/arrow-rs/issues/6988
**Describe the bug**
If we create a RecordBatch with no columns (and no rows), serialize it to
Parquet bytes via `parquet::arrow::ArrowWriter`, and attempt to deserialize it
back via `parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder`, we
get the error `"Repetition level must be defined for a primitive type"`.
**To Reproduce**
```rust
use arrow::array::RecordBatch;
use arrow::array::RecordBatchOptions;
use arrow::datatypes::Field;
use arrow::datatypes::Schema;
use bytes::Bytes;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ArrowWriter;
use parquet::basic::Compression;
use parquet::basic::ZstdLevel;
use parquet::file::properties::WriterProperties;
use std::sync::Arc;
#[test]
fn arrow_rs_repro() {
// create empty record batch
let empty_fields: Vec<Field> = vec![];
let empty_schema = Arc::new(Schema::new(empty_fields));
let empty_batch = RecordBatch::try_new_with_options(
empty_schema,
vec![],
&RecordBatchOptions::default().with_row_count(Some(0)),
)
.unwrap();
// write to parquet
let mut parquet_bytes: Vec<u8> = Vec::new();
let props = WriterProperties::builder()
.set_compression(Compression::ZSTD(ZstdLevel::default()))
.build();
let mut writer =
ArrowWriter::try_new(&mut parquet_bytes, empty_batch.schema(),
Some(props)).unwrap();
writer.write(&empty_batch).unwrap();
writer.close().unwrap();
assert_eq!(
String::from_utf8_lossy(&parquet_bytes),
"PAR1\u{15}\u{2}\u{19}\u{1c}H\u{c}arrow_schema\u{15}\0\0\u{16}\0\u{19}\u{c}\u{19}\u{1c}\u{18}\u{c}ARROW:schema\u{18}L/////zAAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAAAAAA=\0\u{18}\u{19}parquet-rs
version 53.3.0\u{19}\u{c}\0�\0\0\0PAR1"
);
// read from parquet
let bytes = Bytes::from(parquet_bytes);
let result = ParquetRecordBatchReaderBuilder::try_new(bytes);
// REPRODUCTION: below fails with
// called `Result::unwrap()` on an `Err` value: General("Repetition
level must be defined for a primitive type")
result.unwrap();
}
```
**Expected behavior**
We should be able to reconstruct our original no-column RecordBatch.
**Additional context**
Using arrow / parquet crate versions `53.3.0`
I tested this behavior in PyArrow v18.1.0 (which is backed by arrow-cpp) and
I found that:
- The Parquet bytes produced by PyArrow via `pyarrow.parquet.write_table()`
for a no-column `pa.Table` can be successfully read by
`pyarrow.parquet.read_table()` as well as the Rust
`ParquetRecordBatchReaderBuilder` approach above.
- `pyarrow.parquet.read_table()` can successfully read the bytes produced by
rust (`parquet_bytes` above).
I did a little debugging and found two differences in the produced file
metadata between PyArrow and arrow-rs:
1. The file metadata in the PyArrow-produced Parquet bytes has a single
SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in
the Rust-produced bytes does have the single SchemaElement, with the
num_children: 0, but repetition_type is unspecified. This discrepancy leads to
`schema::types::from_thrift_helper` throwing the error for the Rust-produced
bytes.
2. The PyArrow file metadata has a single row group with 0s for
total_byte_size, num_rows, etc., whereas the Rust file metadata has no row
groups. I'm not sure this is important for this particular error.
This perhaps points to two distinct bugs:
1. Like PyArrow, `ParquetRecordBatchReaderBuilder` should be able to read
the Rust-produced bytes, forgiving the lack of repetition_type in the
SchemaElement at least for the case where there are 0 columns.
2. Like PyArrow, `ArrowWriter` should be able to produce a Parquet file that
properly specifies that repetition_type.
I'd also very much accept a "Hey you didn't see this config option over
here" as a solution for my particular usage!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]