RE: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Arthur Passos Wed, 16 Nov 2022 11:31:02 -0800

Hi Niranda,

Yes, the offsets are properly set and if call arrow::ListArray::Flatten(), 
it'll slice based on those offsets and data will be "correct". The problem is 
that this is not always true, I have just tested against a much simpler test 
parquet file and this logic doesn't apply. The arrow::ListArray::values member 
is not shared across all chunks and offsets are all zero-based. The file that 
triggers the former case contains confidential data, but the latter is 
generated with the below python script:



import pyarrow as pa
import pyarrow.parquet as pq
arr = pa.array([[1, 2] for i in range(70000)])
table  = pa.table([arr], ["arr"])
pq.write_table(table, "a-test.parquet")

So it looks like arrow::ListArray::values might or might not be shared across 
chunks. If it's shared, then offsets are not zero based. If it's not shared, 
offsets are zero based. I am under the feeling this is an implementation detail 
and I am facing such problems because I am accessing "low level APIs"? If 
that's so, what would be the proper/ reliable way to extract the offsets and 
nested column data if type is not known at compile time AND it might contain 
multiple chunks.


I already shared above how I am extracting the arrow nested column from an 
arrow list column. For reference, the below method is the one used to extract 
the offsets. It starts at index 1 because I do not store 0 offsets.

auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & 
arrow_column) {
    std::vector<uint64_t> offsets;

    offsets.reserve(arrow_column->length());

    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
    {
        arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
&>(*(arrow_column->chunk(chunk_i)));
        auto arrow_offsets_array = list_chunk.offsets();
        auto & arrow_offsets = dynamic_cast<arrow::Int32Array 
&>(*arrow_offsets_array);
        for (int64_t i = 1; i < arrow_offsets.length(); ++i)
            offsets.emplace_back(arrow_offsets.Value(i));
    }
    return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets));
}

Numeric column (Int64) data extraction is with the below method:

template <typename NumericType>
static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & 
arrow_column)
{
    std::vector<NumericType> array;

    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
    {
        std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i);
        auto chunk_length = chunk->length();
        if (chunk_length == 0)
            continue;

        /// buffers[0] is a null bitmap and buffers[1] are actual values
        std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
        const auto * raw_data = reinterpret_cast<const NumericType 
*>(buffer->data());
        array.insert(array.end(), raw_data, raw_data + chunk_length);
    }

    return std::make_shared<NumericColumn<NumericType>>(std::move(array));
}

Last but not least, these methods get called recursively by the below 
readArrowColumn:

std::shared_ptr<Column> readArrowColumn(auto arrow_column) {
    switch (arrow_column->type()->id()) {
        case arrow::Type::INT64:
        {
            return readNumericColumn<uint64_t>(arrow_column);
        }
        case arrow::Type::LIST:
        {
            auto arrow_nested_column = getNestedArrowColumn(arrow_column);
            auto nested_column = readArrowColumn(arrow_nested_column);
            auto offsets_column = readOffsetsFromArrowListColumn(arrow_column);
            return std::make_shared<ArrayColumn>(nested_column, offsets_column);
        }
    }
    return nullptr;

}

Thanks,
Arthur
________________________________
De: Niranda Perera <[email protected]>
Enviado: quarta-feira, 16 de novembro de 2022 12:55
Para: [email protected] <[email protected]>
Cc: Alan Souza <[email protected]>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Did you check the offset array? AFAIU one way of constructing chunks of list 
arrays, is duplicating a global value array, and having monotonically 
increasing offsets in the offset arrays.
If the offsets are all zero-based, it would be a bug.

On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:
Hi Alan,

In my case, arrow::ListArray::values seems to point to the same memory location 
for all chunks. It feels like I need to offset it by the chunk offset or 
something like that, but that would assume the arrow::ListArray::values method 
always point to the same memory location for all chunks, which doesn't seem to 
be the case for other files.

Thanks for the ArrowWriteProperties tip.

Best,
Arthur
________________________________
De: Alan Souza via user <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 16 de novembro de 2022 11:02
Para: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Hello Arthur. I am using something like this:


auto chunked_column = table->GetColumnByName(col_name);
auto listArray = 
std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0)); // I 
have only one chunk but this is not a problem
auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values());

In this example I am using the LargeListArray but it is similar to the ListArray

Not related to your issue. but is necessary to customize the options of the 
ArrowWriterProperties to save all the type information, for instance:

parquet::ArrowWriterProperties::Builder builder;
builder.store_schema();


Without this the parquet file is created by the arrow library has a ListArray 
instead of using a LargeListArray on these columns.

On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:


Hi Niranda

Yes, one of the columns (there are over 50 columns in this file), is of type 
List<Int64>. You can see that in the parquet-tools inspect output below:

arthur@arthur:~/parquet-validation$ parquet-tools inspect 
~/Downloads/test_file.parquet | grep test_array_column -A 10
path: test_array_column.list.element
max_definition_level: 2
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 56%)

As far as I know, the arrow lib represents List columns with an array of 
offsets and one or more chunks of memory storing the nested column data (). On 
my side, I have a very similar structure, so I would like to extract both the 
array of offsets and the nested column data with the less amount of copying 
possible.

Best,
Arthur

________________________________
De: Niranda Perera <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 16 de novembro de 2022 10:28
Para: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Hi Arthur,

I'm not very clear about the usecase here. Just to clarify, in your original 
parquet file, do you have List<int64> typed columns?

On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:
Hi

I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with two chunks (column->num_chunks() == 2). The 
column in question, although not limited to, is of type Array(Int64).

I want to extract the data (nested column data) as well as the offsets from 
that column. I have found only one 
example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
 of Array columns and it assumes the nested type is known at compile time AND 
the column has only one chunk.

I have tried to loop over the Array(Int64) column chunks and grab the 
`values()` member, but for some reason, for that specific Parquet file, the 
values member point to the same memory location. Therefore, if I do something 
like the below, I end up with duplicated data:


static std::shared_ptr<arrow::ChunkedArray> 
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
    arrow::ArrayVector array_vector;
    array_vector.reserve(arrow_column->num_chunks());
    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
      {
          arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
&>(*(arrow_column->chunk(chunk_i)));
          std::shared_ptr<arrow::Array> chunk = list_chunk.values();
          array_vector.emplace_back(std::move(chunk));
      }
    return std::make_shared<arrow::ChunkedArray>(array_vector);
}

I can provide more info, but to keep the initial request short and simple, I'll 
leave it at that.

Thanks in advance,
Arthur


--
Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>



--
Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>

RE: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Reply via email to