Thank you! Getting feedback about the API changes from a user would be very helpful
ANdrew On Mon, May 23, 2022 at 6:44 AM Ahmed Riza <[email protected]> wrote: > Thanks Andrew. Will take a deeper look. Can see API changes, in > particular around the in-memory cursor we are currently using. > > Also need to create a minimal Parquet file to demonstrate the issues we've > seen. > > Thanks > Ahmed. > > On Mon, 23 May 2022, 11:28 Andrew Lamb, <[email protected]> wrote: > >> Raphael has a proposed PR[1] to improve this situation. >> >> Ahmed, I wonder if you have a chance to add your opinion >> >> [1] https://github.com/apache/arrow-rs/pull/1719 >> >> On Sat, May 21, 2022 at 6:42 AM Andrew Lamb <[email protected]> wrote: >> >>> Thanks Ahmed, yes I can see if you tried to write multiple RecordBatches >>> to the same stream concurrently this would cause a problem. >>> >>> I filed [1] for the corrupt file and [2] for supportingparallel write -- >>> if you are able to provide examples of the parallel code that compiles as >>> well as what you did with parquet2 that would be most helpful. Either via >>> email or directly on the ticket. >>> >>> Thanks again for the report, >>> Andrew >>> >>> [1] https://github.com/apache/arrow-rs/issues/1717 >>> [2] https://github.com/apache/arrow-rs/issues/1718 >>> >>> >>> >>> >>> On Thu, May 19, 2022 at 9:45 AM Ahmed Riza <[email protected]> wrote: >>> >>>> Hi Andrew, >>>> >>>> Thanks for checking. Turns out that this was my bad. What I did >>>> subsequently with the concatenated batches was naive and broken. >>>> >>>> I was attempting to build a single Parquet from the batches in what I >>>> thought was a parallel manner using the ArrowWriter. I tried to >>>> "parallelise" the following serial code. >>>> >>>> let cursor = InMemoryWriteableCursor::default(); >>>> let mut writer = ArrowWriter::try_new(cursor.clone(), >>>> schema, None)?; >>>> for batch in batches { >>>> writer.write(batch)?; >>>> } >>>> writer.close()?; >>>> >>>> I realised that although the compiler accepted my incorrect parallel >>>> version of this code, it in-fact was not sound which caused the corruption. >>>> >>>> Can't see a way that I can do this in parallel with the current >>>> implementation. I think parquet2 can do this, but I had trouble with >>>> parquet2 as it couldn't handle the deeply nested Parquet we have. Will >>>> check further as to where parquet2 is falling over and raise it on >>>> parquet2. >>>> >>>> Thanks, >>>> Ahmed. >>>> >>>> On Thu, May 19, 2022 at 12:21 PM Andrew Lamb <[email protected]> >>>> wrote: >>>> >>>>> Hi Ahmed, >>>>> >>>>> It is valid to concatenate batches and the process you describe seems >>>>> fine. >>>>> >>>>> Your description certainly sounds as if there is something wrong with >>>>> `concat` that is producing incorrect RecordBatches -- would it be possible >>>>> to provide more information and file a ticket in >>>>> https://github.com/apache/arrow-rs/issues ? >>>>> >>>>> >>>>> Andrew >>>>> >>>>> p.s. I wonder if you are using `StructArray` or `ListArray`s? >>>>> >>>>> >>>>> On Thu, May 19, 2022 at 4:47 AM Ahmed Riza <[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> If we have an Arrow RecordBatch per Parquet file created via >>>>>> ParquetFileArrowReader, is it valid to concatenate these multiple >>>>>> batches? >>>>>> >>>>>> Let's say we have 1000 Parquet files, and created a Vec<RecordBatch> >>>>>> containing 1000 Record Batches. What we'd like to do is, take chunks of, >>>>>> say, 100 of these at a time, and concatenate them to produce a vector of >>>>>> 10 >>>>>> Record Batches. Something like the following: >>>>>> >>>>>> let combined_record_batches = record_batchs >>>>>> .chunks(100) >>>>>> .map(|rb_chunk| RecordBatch::concat(&schema, >>>>>> rb_chunk)) >>>>>> .collect::<anyhow::Result<Vec<_>>>()?; >>>>>> >>>>>> Whilst the above works as far as concatenating them goes, we've >>>>>> found that the resulting Record Batches cannot be converted to Parquet as >>>>>> they seem to be corrupted somehow. That is, using an ArrowWriter and >>>>>> writing these concatenated Record Batches results in panics such as the >>>>>> following: >>>>>> >>>>>> A thread panicked, PanicInfo { payload: Any { .. }, message: >>>>>> Some(index out of bounds: the len is 163840 but the index is >>>>>> 18446744073709387776), location: Location { file: >>>>>> "/home/ahmed/.cargo/registr >>>>>> y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs", >>>>>> line: 504, col: 41 }, can_unwind: true } >>>>>> >>>>>> Thanks, >>>>>> Ahmed Riza >>>>>> >>>>> >>>> >>>> -- >>>> Ahmed Riza >>>> >>>
