Arthur Passos created ARROW-18307: ------------------------------------- Summary: [C++] Read list/array data from ChunkedArray with multiple chunks Key: ARROW-18307 URL: https://issues.apache.org/jira/browse/ARROW-18307 Project: Apache Arrow Issue Type: Test Components: C++ Reporter: Arthur Passos
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with multiple chunks (column->num_chunks() > 1). The column in question, although not limited to, is of type Array(Int64). I want to convert this arrow column into an internal structure that contains a contiguous chunk of memory for the data and a vector of offsets, very similar to arrow's structure. The code I have so far works in two "phases": 1. Get nested arrow column data. In that case, get Int64 data out of Array(Int64). 2. Get offsets from Array(Int64). To achieve the #1, I am looping over the chunks and storing arrow::Array::values into a new arrow::ChunkedArray. {code:java} static std::shared_ptr<arrow::ChunkedArray> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); std::shared_ptr<arrow::Array> chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared<arrow::ChunkedArray>(array_vector); }{code} This does not work as expected, tho. Even though there are multiple chunks, the arrow::Array::values method returns the very same buffer for all of them, which ends up duplicating the data on my side. I then looked through more examples and came across the [ColumnarTableToVector example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121]. It looks like this example assumes there is only on chunk and ignores the possibility of it having multiple chunks. It's probably just a detail and the test wasn't actually intended to cover multiple chunks. I managed to get the expected output doing something like the below: {code:java} auto & list_chunk1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))); auto & list_chunk2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))); auto l1_offset = *list_chunk1.raw_value_offsets(); auto l2_offset = *list_chunk2.raw_value_offsets(); auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length); auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length); auto lcv1 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - l1_offset).ValueOrDie(); auto lcv2 = dynamic_cast<::arrow::ListArray &>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - l2_offset).ValueOrDie();{code} This looks too hackish and I feel like there is a much better way. Hence, my question: How do I properly extract the data & offsets out of such column? A more generic version of this is: how to extract the data out of ChunkedArrays with multiple chunks? -- This message was sent by Atlassian Jira (v8.20.10#820010)