Hi Niranda Yes, one of the columns (there are over 50 columns in this file), is of type List<Int64>. You can see that in the parquet-tools inspect output below:
arthur@arthur:~/parquet-validation$ parquet-tools inspect ~/Downloads/test_file.parquet | grep test_array_column -A 10 path: test_array_column.list.element max_definition_level: 2 max_repetition_level: 1 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: GZIP (space_saved: 56%) As far as I know, the arrow lib represents List columns with an array of offsets and one or more chunks of memory storing the nested column data (). On my side, I have a very similar structure, so I would like to extract both the array of offsets and the nested column data with the less amount of copying possible. Best, Arthur ________________________________ De: Niranda Perera <[email protected]> Enviado: quarta-feira, 16 de novembro de 2022 10:28 Para: [email protected] <[email protected]> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hi Arthur, I'm not very clear about the usecase here. Just to clarify, in your original parquet file, do you have List<int64> typed columns? On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]<mailto:[email protected]>> wrote: Hi I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with two chunks (column->num_chunks() == 2). The column in question, although not limited to, is of type Array(Int64). I want to extract the data (nested column data) as well as the offsets from that column. I have found only one example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121> of Array columns and it assumes the nested type is known at compile time AND the column has only one chunk. I have tried to loop over the Array(Int64) column chunks and grab the `values()` member, but for some reason, for that specific Parquet file, the values member point to the same memory location. Therefore, if I do something like the below, I end up with duplicated data: static std::shared_ptr<arrow::ChunkedArray> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); std::shared_ptr<arrow::Array> chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared<arrow::ChunkedArray>(array_vector); } I can provide more info, but to keep the initial request short and simple, I'll leave it at that. Thanks in advance, Arthur -- Niranda Perera https://niranda.dev/ @n1r44<https://twitter.com/N1R44>
