Hi Alan, In my case, arrow::ListArray::values seems to point to the same memory location for all chunks. It feels like I need to offset it by the chunk offset or something like that, but that would assume the arrow::ListArray::values method always point to the same memory location for all chunks, which doesn't seem to be the case for other files.
Thanks for the ArrowWriteProperties tip. Best, Arthur ________________________________ De: Alan Souza via user <[email protected]> Enviado: quarta-feira, 16 de novembro de 2022 11:02 Para: [email protected] <[email protected]> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hello Arthur. I am using something like this: auto chunked_column = table->GetColumnByName(col_name); auto listArray = std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0)); // I have only one chunk but this is not a problem auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values()); In this example I am using the LargeListArray but it is similar to the ListArray Not related to your issue. but is necessary to customize the options of the ArrowWriterProperties to save all the type information, for instance: parquet::ArrowWriterProperties::Builder builder; builder.store_schema(); Without this the parquet file is created by the arrow library has a ListArray instead of using a LargeListArray on these columns. On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos <[email protected]> wrote: Hi Niranda Yes, one of the columns (there are over 50 columns in this file), is of type List<Int64>. You can see that in the parquet-tools inspect output below: arthur@arthur:~/parquet-validation$ parquet-tools inspect ~/Downloads/test_file.parquet | grep test_array_column -A 10 path: test_array_column.list.element max_definition_level: 2 max_repetition_level: 1 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: GZIP (space_saved: 56%) As far as I know, the arrow lib represents List columns with an array of offsets and one or more chunks of memory storing the nested column data (). On my side, I have a very similar structure, so I would like to extract both the array of offsets and the nested column data with the less amount of copying possible. Best, Arthur ________________________________ De: Niranda Perera <[email protected]> Enviado: quarta-feira, 16 de novembro de 2022 10:28 Para: [email protected] <[email protected]> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hi Arthur, I'm not very clear about the usecase here. Just to clarify, in your original parquet file, do you have List<int64> typed columns? On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]<mailto:[email protected]>> wrote: Hi I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with two chunks (column->num_chunks() == 2). The column in question, although not limited to, is of type Array(Int64). I want to extract the data (nested column data) as well as the offsets from that column. I have found only one example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121> of Array columns and it assumes the nested type is known at compile time AND the column has only one chunk. I have tried to loop over the Array(Int64) column chunks and grab the `values()` member, but for some reason, for that specific Parquet file, the values member point to the same memory location. Therefore, if I do something like the below, I end up with duplicated data: static std::shared_ptr<arrow::ChunkedArray> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); std::shared_ptr<arrow::Array> chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared<arrow::ChunkedArray>(array_vector); } I can provide more info, but to keep the initial request short and simple, I'll leave it at that. Thanks in advance, Arthur -- Niranda Perera https://niranda.dev/ @n1r44<https://twitter.com/N1R44>
