Arthur Passos created ARROW-18307:
-------------------------------------

             Summary: [C++] Read list/array data from ChunkedArray with 
multiple chunks
                 Key: ARROW-18307
                 URL: https://issues.apache.org/jira/browse/ARROW-18307
             Project: Apache Arrow
          Issue Type: Test
          Components: C++
            Reporter: Arthur Passos


I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with multiple chunks (column->num_chunks() > 1). The 
column in question, although not limited to, is of type Array(Int64).

 

I want to convert this arrow column into an internal structure that contains a 
contiguous chunk of memory for the data and a vector of offsets, very similar 
to arrow's structure. The code I have so far works in two "phases":

1. Get nested arrow column data. In that case, get Int64 data out of 
Array(Int64).
2. Get offsets from Array(Int64).

To achieve the #1, I am looping over the chunks and storing 
arrow::Array::values into a new arrow::ChunkedArray.



 
{code:java}
static std::shared_ptr<arrow::ChunkedArray> 
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
&>(*(arrow_column->chunk(chunk_i)));
std::shared_ptr<arrow::Array> chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared<arrow::ChunkedArray>(array_vector);
}{code}

This does not work as expected, tho. Even though there are multiple chunks, the 
arrow::Array::values method returns the very same buffer for all of them, which 
ends up duplicating the data on my side.

I then looked through more examples and came across the [ColumnarTableToVector 
example|https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121].
 It looks like this example assumes there is only on chunk and ignores the 
possibility of it having multiple chunks. It's probably just a detail and the 
test wasn't actually intended to cover multiple chunks.

I managed to get the expected output doing something like the below:
{code:java}
auto & list_chunk1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0)));
auto & list_chunk2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1)));

auto l1_offset = *list_chunk1.raw_value_offsets();
auto l2_offset = *list_chunk2.raw_value_offsets();

auto l1_end_offset = list_chunk1.value_offset(list_chunk1.data()->length);
auto l2_end_offset = list_chunk2.value_offset(list_chunk2.data()->length);

auto lcv1 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(0))).values()->SliceSafe(l1_offset, l1_end_offset - 
l1_offset).ValueOrDie();
auto lcv2 = dynamic_cast<::arrow::ListArray 
&>(*(arrow_column->chunk(1))).values()->SliceSafe(l2_offset, l2_end_offset - 
l2_offset).ValueOrDie();{code}
This looks too hackish and I feel like there is a much better way.

Hence, my question: How do I properly extract the data & offsets out of such 
column? A more generic version of this is: how to extract the data out of 
ChunkedArrays with multiple chunks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to