Did you check the offset array? AFAIU one way of constructing chunks of list arrays, is duplicating a global value array, and having monotonically increasing offsets in the offset arrays. If the offsets are all zero-based, it would be a bug.
On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]> wrote: > Hi Alan, > > In my case, *arrow::ListArray::values* seems to point to the same memory > location for all chunks. It feels like I need to offset it by the chunk > offset or something like that, but that would assume the > *arrow::ListArray::values* method always point to the same memory > location for all chunks, which doesn't seem to be the case for other files. > > Thanks for the ArrowWriteProperties tip. > > Best, > Arthur > ------------------------------ > *De:* Alan Souza via user <[email protected]> > *Enviado:* quarta-feira, 16 de novembro de 2022 11:02 > *Para:* [email protected] <[email protected]> > *Assunto:* Re: [C++] Need an example on how to extract data from a column > of type Array(int64) with multiple chunks > > Hello Arthur. I am using something like this: > > > auto chunked_column = table->GetColumnByName(col_name); > auto listArray = std::static_pointer_cast<arrow::LargeListArray > >(chunked_column->chunk(0)); // I have only one chunk but this is not a > problem > auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values > ()); > > In this example I am using the LargeListArray but it is similar to the > ListArray > > Not related to your issue. but is necessary to customize the options of > the ArrowWriterProperties to save all the type information, for instance: > > parquet::ArrowWriterProperties::Builder builder; > builder.store_schema(); > > > Without this the parquet file is created by the arrow library has a > ListArray instead of using a LargeListArray on these columns. > > On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos < > [email protected]> wrote: > > > Hi Niranda > > Yes, one of the columns (there are over 50 columns in this file), is of > type List<Int64>. You can see that in the parquet-tools inspect output > below: > > arthur@arthur:~/parquet-validation$ parquet-tools inspect > ~/Downloads/test_file.parquet | grep test_array_column -A 10 > path: test_array_column.list.element > max_definition_level: 2 > max_repetition_level: 1 > physical_type: INT64 > logical_type: None > converted_type (legacy): NONE > compression: GZIP (space_saved: 56%) > > > As far as I know, the arrow lib represents List columns with an array of > offsets and one or more chunks of memory storing the nested column data (). > On my side, I have a very similar structure, so I would like to extract > both the array of offsets and the nested column data with the less amount > of copying possible. > > Best, > Arthur > > ------------------------------ > *De:* Niranda Perera <[email protected]> > *Enviado:* quarta-feira, 16 de novembro de 2022 10:28 > *Para:* [email protected] <[email protected]> > *Assunto:* Re: [C++] Need an example on how to extract data from a column > of type Array(int64) with multiple chunks > > Hi Arthur, > > I'm not very clear about the usecase here. Just to clarify, in your > original parquet file, do you have List<int64> typed columns? > > On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]> > wrote: > > Hi > > I am reading a parquet file with arrow::RecordBatchReader and the > arrow::Table returned contains columns with two chunks > (column->num_chunks() == 2). The column in question, although not limited > to, is of type Array(Int64). > > I want to extract the data (nested column data) as well as the offsets > from that column. I have found only one example > <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121> > of Array columns and it assumes the nested type is known at compile time > AND the column has only one chunk. > > I have tried to loop over the Array(Int64) column chunks and grab the > `values()` member, but for some reason, for that specific Parquet file, the > values member point to the same memory location. Therefore, if I do > something like the below, I end up with duplicated data: > > static std::shared_ptr<arrow::ChunkedArray> > getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) > { arrow::ArrayVector array_vector; > array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, > num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < > num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = > dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); > std::shared_ptr<arrow::Array> chunk = list_chunk.values(); > array_vector.emplace_back(std::move(chunk)); } return > std::make_shared<arrow::ChunkedArray>(array_vector); > } > > > I can provide more info, but to keep the initial request short and simple, > I'll leave it at that. > > Thanks in advance, > Arthur > > > > -- > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> > > -- Niranda Perera https://niranda.dev/ @n1r44 <https://twitter.com/N1R44>
