Raphael has a proposed PR[1] to improve this situation. Ahmed, I wonder if you have a chance to add your opinion
[1] https://github.com/apache/arrow-rs/pull/1719 On Sat, May 21, 2022 at 6:42 AM Andrew Lamb <[email protected]> wrote: > Thanks Ahmed, yes I can see if you tried to write multiple RecordBatches > to the same stream concurrently this would cause a problem. > > I filed [1] for the corrupt file and [2] for supportingparallel write -- > if you are able to provide examples of the parallel code that compiles as > well as what you did with parquet2 that would be most helpful. Either via > email or directly on the ticket. > > Thanks again for the report, > Andrew > > [1] https://github.com/apache/arrow-rs/issues/1717 > [2] https://github.com/apache/arrow-rs/issues/1718 > > > > > On Thu, May 19, 2022 at 9:45 AM Ahmed Riza <[email protected]> wrote: > >> Hi Andrew, >> >> Thanks for checking. Turns out that this was my bad. What I did >> subsequently with the concatenated batches was naive and broken. >> >> I was attempting to build a single Parquet from the batches in what I >> thought was a parallel manner using the ArrowWriter. I tried to >> "parallelise" the following serial code. >> >> let cursor = InMemoryWriteableCursor::default(); >> let mut writer = ArrowWriter::try_new(cursor.clone(), schema, >> None)?; >> for batch in batches { >> writer.write(batch)?; >> } >> writer.close()?; >> >> I realised that although the compiler accepted my incorrect parallel >> version of this code, it in-fact was not sound which caused the corruption. >> >> Can't see a way that I can do this in parallel with the current >> implementation. I think parquet2 can do this, but I had trouble with >> parquet2 as it couldn't handle the deeply nested Parquet we have. Will >> check further as to where parquet2 is falling over and raise it on parquet2. >> >> Thanks, >> Ahmed. >> >> On Thu, May 19, 2022 at 12:21 PM Andrew Lamb <[email protected]> >> wrote: >> >>> Hi Ahmed, >>> >>> It is valid to concatenate batches and the process you describe seems >>> fine. >>> >>> Your description certainly sounds as if there is something wrong with >>> `concat` that is producing incorrect RecordBatches -- would it be possible >>> to provide more information and file a ticket in >>> https://github.com/apache/arrow-rs/issues ? >>> >>> >>> Andrew >>> >>> p.s. I wonder if you are using `StructArray` or `ListArray`s? >>> >>> >>> On Thu, May 19, 2022 at 4:47 AM Ahmed Riza <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> If we have an Arrow RecordBatch per Parquet file created via >>>> ParquetFileArrowReader, is it valid to concatenate these multiple batches? >>>> >>>> Let's say we have 1000 Parquet files, and created a Vec<RecordBatch> >>>> containing 1000 Record Batches. What we'd like to do is, take chunks of, >>>> say, 100 of these at a time, and concatenate them to produce a vector of 10 >>>> Record Batches. Something like the following: >>>> >>>> let combined_record_batches = record_batchs >>>> .chunks(100) >>>> .map(|rb_chunk| RecordBatch::concat(&schema, rb_chunk)) >>>> .collect::<anyhow::Result<Vec<_>>>()?; >>>> >>>> Whilst the above works as far as concatenating them goes, we've >>>> found that the resulting Record Batches cannot be converted to Parquet as >>>> they seem to be corrupted somehow. That is, using an ArrowWriter and >>>> writing these concatenated Record Batches results in panics such as the >>>> following: >>>> >>>> A thread panicked, PanicInfo { payload: Any { .. }, message: Some(index >>>> out of bounds: the len is 163840 but the index is 18446744073709387776), >>>> location: Location { file: "/home/ahmed/.cargo/registr >>>> y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs", >>>> line: 504, col: 41 }, can_unwind: true } >>>> >>>> Thanks, >>>> Ahmed Riza >>>> >>> >> >> -- >> Ahmed Riza >> >
