Hi Andrew,

Thanks for checking. Turns out that this was my bad.  What I did
subsequently with the concatenated batches was naive and broken.

I was attempting to build a single Parquet from the batches in what I
thought was a parallel manner using the ArrowWriter.  I tried to
"parallelise" the following serial code.

            let cursor = InMemoryWriteableCursor::default();
            let mut writer = ArrowWriter::try_new(cursor.clone(), schema,
None)?;
            for batch in batches {
                writer.write(batch)?;
            }
            writer.close()?;

I realised that although the compiler accepted my incorrect parallel
version of this code, it in-fact was not sound which caused the corruption.

Can't see a way that I can do this in parallel with the current
implementation.  I think parquet2 can do this, but I had trouble with
parquet2 as it couldn't handle the deeply nested Parquet we have.  Will
check further as to where parquet2 is falling over and raise it on parquet2.

Thanks,
Ahmed.

On Thu, May 19, 2022 at 12:21 PM Andrew Lamb <[email protected]> wrote:

> Hi Ahmed,
>
> It is valid to concatenate batches and the process you describe seems
> fine.
>
> Your description certainly sounds as if there is something wrong with
> `concat` that is producing incorrect RecordBatches -- would it be possible
> to provide more information and file a ticket in
> https://github.com/apache/arrow-rs/issues ?
>
>
> Andrew
>
> p.s. I wonder if you are using `StructArray` or `ListArray`s?
>
>
> On Thu, May 19, 2022 at 4:47 AM Ahmed Riza <[email protected]> wrote:
>
>> Hi,
>>
>> If we have an Arrow RecordBatch per Parquet file created via
>> ParquetFileArrowReader, is it valid to concatenate these multiple batches?
>>
>> Let's say we have 1000 Parquet files, and created a Vec<RecordBatch>
>> containing 1000 Record Batches. What we'd like to do is, take chunks of,
>> say, 100 of these at a time, and concatenate them to produce a vector of 10
>> Record Batches.  Something like the following:
>>
>>             let combined_record_batches = record_batchs
>>                 .chunks(100)
>>                 .map(|rb_chunk| RecordBatch::concat(&schema, rb_chunk))
>>                 .collect::<anyhow::Result<Vec<_>>>()?;
>>
>> Whilst the above works as far as concatenating them goes, we've
>> found that the resulting Record Batches cannot be converted to Parquet as
>> they seem to be corrupted somehow.  That is, using an ArrowWriter and
>> writing these concatenated Record Batches results in panics such as the
>> following:
>>
>> A thread panicked, PanicInfo { payload: Any { .. }, message: Some(index
>> out of bounds: the len is 163840 but the index is 18446744073709387776),
>> location: Location { file: "/home/ahmed/.cargo/registr
>> y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs",
>> line: 504, col: 41 }, can_unwind: true }
>>
>> Thanks,
>> Ahmed Riza
>>
>

-- 
Ahmed Riza

Reply via email to