Hi,
If we have an Arrow RecordBatch per Parquet file created via
ParquetFileArrowReader, is it valid to concatenate these multiple batches?
Let's say we have 1000 Parquet files, and created a Vec<RecordBatch>
containing 1000 Record Batches. What we'd like to do is, take chunks of,
say, 100 of these at a time, and concatenate them to produce a vector of 10
Record Batches. Something like the following:
let combined_record_batches = record_batchs
.chunks(100)
.map(|rb_chunk| RecordBatch::concat(&schema, rb_chunk))
.collect::<anyhow::Result<Vec<_>>>()?;
Whilst the above works as far as concatenating them goes, we've found that
the resulting Record Batches cannot be converted to Parquet as they seem to
be corrupted somehow. That is, using an ArrowWriter and writing these
concatenated Record Batches results in panics such as the following:
A thread panicked, PanicInfo { payload: Any { .. }, message: Some(index out
of bounds: the len is 163840 but the index is 18446744073709387776),
location: Location { file: "/home/ahmed/.cargo/registr
y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs",
line: 504, col: 41 }, can_unwind: true }
Thanks,
Ahmed Riza