Actually to answer my own question, this is likely expected behavior.  What
I bet is going on is the String column contains some pretty big strings and
ends up being chunked itself.  Then the TableBatchReader used for iteration
[1] slices the list array to even out the chunks [2].

If you wanted to avoid this specifically for Parquet with the current
implementation you could use ColumnReader [3] directly but I don't think we
want to guarantee the exact structure of Arrays.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.cc#L1043
[2]
https://github.com/apache/arrow/blob/e1cf9041c482486c314a3d143f4b01d35baeaab4/cpp/src/arrow/table.cc#L661
[3]
https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.h#L289

On Wed, Nov 23, 2022 at 12:13 AM Micah Kornfield <[email protected]>
wrote:

> Hi Arthur,
>
>> I am reading a parquet file with arrow::RecordBatchReader and the
>> arrow::Table returned contains columns with two chunks
>> (column->num_chunks() == 2). The column in question, although not limited
>> to, is of type Array(Int64).
>
>
> Would it be possible to provide the code you are using to instantiate the
> RecordBatchReader and then construct the Table?  With the sample python
> code, reading the data back in python (pq.read_table) produces a single
> chunked array for me.  Also my understanding of the read path of Parquet is
> that it should be creating a new offset of each array it returns (with
> values not shared [1]).  So I think something must be doing this chunking
> at a higher level but could very well be mistaken.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.cc#L620
>
> On Tue, Nov 22, 2022 at 12:22 PM Arthur Passos <[email protected]>
> wrote:
>
>> Hello again,
>>
>> I talked to our customer and he was able to generate a randomized file
>> that triggers the discussed situation. Therefore, I have attached both
>> files so you can experiment with it.
>>
>> Files attached:
>>
>> *shared-data-monotonically-increasing-offset.parquet - *Contains a
>> List(Int64) and String column. Internally, it is represented with two
>> chunks. For the List(Int64) column, *arrow::ListArray::values *is shared
>> across chunks and offsets are monotonically increasing.
>>
>> *zero-based-offsets.parquet* - Contains a List(Int64) column. It is
>> represented with two chunks. *arrow::ListArray::values *is not shared
>> and offsets are zero based. This file was generated using the previously
>> shared python script.
>>
>> Thanks,
>> Arthur
>>
>> ------------------------------
>> *De:* Arthur Passos <[email protected]>
>> *Enviado:* terça-feira, 22 de novembro de 2022 08:02
>> *Para:* dl <[email protected]>
>> *Assunto:* RE: [C++] Need an example on how to extract data from a
>> column of type Array(int64) with multiple chunks
>>
>> Hi David,
>>
>> Thanks for the response. Just recapping, I have two files that trigger
>> two different cases: 1. array data is shared across chunks and 2. array
>> data is not shared across chunks. "Array data" being
>> *arrow::ListArray::values*. In the former, offsets are monotonically
>> increasing. In the latter, they are zero based.
>>
>> Unfortunately, the file that triggers the first case contains
>> confidential data from one of our customers. I have spent a fair amount of
>> time trying to generate one, but failed to do so. The latter, I can
>> certainly provide an example. Below is a python script that'll generate it.
>>
>>
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> import random
>>
>>
>> def gen_array(offset):
>> array = []
>> array_length = random.randint(0, 9)
>> for i in range(array_length):
>> array.append(i + offset)
>>
>> return array
>>
>>
>> def gen_arrays(number_of_arrays):
>> list_of_arrays = []
>> for i in range(number_of_arrays):
>> list_of_arrays.append(gen_array(i))
>> return list_of_arrays
>>
>> arr = pa.array(gen_arrays(70000))
>> table  = pa.table([arr], ["arr"])
>> pq.write_table(table, "int-list-zero-based-chunked-array.parquet")
>>
>>
>> Thanks,
>> Arthur
>> ------------------------------
>> *De:* David Li <[email protected]>
>> *Enviado:* segunda-feira, 21 de novembro de 2022 19:05
>> *Para:* dl <[email protected]>
>> *Assunto:* Re: [C++] Need an example on how to extract data from a
>> column of type Array(int64) with multiple chunks
>>
>> Hi Arthur,
>>
>> Sorry for the late reply - is it possible to provide examples of each
>> kind of file? I can try to take a look, at least the behavior here seems
>> confusing.
>>
>> -David
>>
>> On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote:
>>
>> Hi guys.
>>
>> I could not find written evidence that both shared and non shared 
>> *arrow::ListArray::values
>> *can co-exist, but that seems to be the case since I have files that
>> trigger both cases. If any of you have evidence that supports this or that
>> shows this is not accurate, it'll be appreciated.
>>
>> In any case, what I ended up doing is checking whether the offsets are
>> zero-based or not. If the former, that means *arrow::ListArray::values*
>> is not shared across chunks. If the latter, it is shared. This leads to the
>> following logic for *getNested* and *getOffsets*:
>>
>> *getNested:*
>>
>> Loop over all chunks and call *arrow::ListArray::Flatten *to properly
>> slice based on offsets. This will avoid duplicated data in case 
>> *arrow::ListArray::values()
>> *is shared.
>>
>> *getOffsets:*
>>
>> Use a variable to control current offset. Loop through all chunks and
>> check if the chunk offset is zero. If it is, current_offset is updated to
>> the last offset collected. Then, offset is stored as follows: auto offset
>> = arrow_offsets.Value(i); offsets_data.emplace_back(start_offset +
>> offset);
>>
>>
>> Full code can be found in:
>> https://github.com/ClickHouse/ClickHouse/pull/43297
>> <https://github.com/ClickHouse/ClickHouse/pull/43297>
>> Flatten list type arrow chunks on parsing by arthurpassos · Pull Request
>> #43297 · ClickHouse/ClickHouse
>> <https://github.com/ClickHouse/ClickHouse/pull/43297>
>> Changelog category (leave one): Bug Fix (user-visible misbehavior in
>> official stable or prestable release) Changelog entry (a user-readable
>> short description of the changes that goes to CHANGELOG...
>> github.com
>> **
>>
>>
>> Best,
>> Arthur
>>
>>
>> ------------------------------
>>
>> *De:* Arthur Passos <[email protected]>
>> *Enviado:* quarta-feira, 16 de novembro de 2022 16:30
>> *Para:* [email protected] <[email protected]>
>> *Cc:* Alan Souza <[email protected]>
>> *Assunto:* RE: [C++] Need an example on how to extract data from a
>> column of type Array(int64) with multiple chunks
>>
>> Hi Niranda,
>>
>> Yes, the offsets are properly set and if call
>> *arrow::ListArray::Flatten()*, it'll slice based on those offsets and
>> data will be "correct". The problem is that this is not always true, I have
>> just tested against a much simpler test parquet file and this logic doesn't
>> apply. The *arrow::ListArray::values *member is not shared across all
>> chunks and offsets are all zero-based. The file that triggers the former
>> case contains confidential data, but the latter is generated with the below
>> python script:
>>
>> import pyarrow as pa
>> import pyarrow.parquet as pq
>> arr = pa.array([[1, 2] for i in range(70000)])
>> table  = pa.table([arr], ["arr"])
>> pq.write_table(table, "a-test.parquet")
>>
>>
>> So it looks like arrow::ListArray::values might or might not be shared
>> across chunks. If it's shared, then offsets are not zero based. If it's not
>> shared, offsets are zero based. I am under the feeling this is an
>> implementation detail and I am facing such problems because I am accessing
>> "low level APIs"? If that's so, what would be the proper/ reliable way to
>> extract the offsets and nested column data if type is not known at compile
>> time AND it might contain multiple chunks.
>>
>>
>> I already shared above how I am extracting the arrow nested column from
>> an arrow list column. For reference, the below method is the one used to
>> extract the offsets. It starts at index 1 because I do not store 0 offsets.
>>
>> auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & 
>> arrow_column) {
>>     std::vector<uint64_t> offsets;
>>
>>     offsets.reserve(arrow_column->length());
>>
>>     for (size_t chunk_i = 0, num_chunks = 
>> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
>> ++chunk_i)
>>     {
>>         arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
>> &>(*(arrow_column->chunk(chunk_i)));
>>         auto arrow_offsets_array = list_chunk.offsets();
>>         auto & arrow_offsets = dynamic_cast<arrow::Int32Array 
>> &>(*arrow_offsets_array);
>>         for (int64_t i = 1; i < arrow_offsets.length(); ++i)
>>             offsets.emplace_back(arrow_offsets.Value(i));
>>     }
>>     return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets));
>> }
>>
>> Numeric column (Int64) data extraction is with the below method:
>>
>> template <typename NumericType>
>> static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & 
>> arrow_column)
>> {
>>     std::vector<NumericType> array;
>>
>>     for (size_t chunk_i = 0, num_chunks = 
>> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
>> ++chunk_i)
>>     {
>>         std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i);
>>         auto chunk_length = chunk->length();
>>         if (chunk_length == 0)
>>             continue;
>>
>>         /// buffers[0] is a null bitmap and buffers[1] are actual values
>>         std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
>>         const auto * raw_data = reinterpret_cast<const NumericType 
>> *>(buffer->data());
>>         array.insert(array.end(), raw_data, raw_data + chunk_length);
>>     }
>>
>>     return std::make_shared<NumericColumn<NumericType>>(std::move(array));
>> }
>>
>> Last but not least, these methods get called recursively by the below
>> readArrowColumn:
>>
>> std::shared_ptr<Column> readArrowColumn(auto arrow_column) {
>>     switch (arrow_column->type()->id()) {
>>         case arrow::Type::*INT64*:
>>         {
>>             return readNumericColumn<uint64_t>(arrow_column);
>>         }
>>         case arrow::Type::*LIST*:
>>         {
>>             auto arrow_nested_column = getNestedArrowColumn(arrow_column);
>>             auto nested_column = readArrowColumn(arrow_nested_column);
>>             auto offsets_column = 
>> readOffsetsFromArrowListColumn(arrow_column);
>>             return std::make_shared<ArrayColumn>(nested_column, 
>> offsets_column);
>>         }
>>     }
>>     return nullptr;
>>
>> }
>>
>>
>> Thanks,
>> Arthur
>>
>> ------------------------------
>>
>> *De:* Niranda Perera <[email protected]>
>> *Enviado:* quarta-feira, 16 de novembro de 2022 12:55
>> *Para:* [email protected] <[email protected]>
>> *Cc:* Alan Souza <[email protected]>
>> *Assunto:* Re: [C++] Need an example on how to extract data from a
>> column of type Array(int64) with multiple chunks
>>
>> Did you check the offset array? AFAIU one way of constructing chunks of
>> list arrays, is duplicating a global value array, and having monotonically
>> increasing offsets in the offset arrays.
>> If the offsets are all zero-based, it would be a bug.
>>
>> On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]>
>> wrote:
>>
>> Hi Alan,
>>
>> In my case, *arrow::ListArray::values* seems to point to the same memory
>> location for all chunks. It feels like I need to offset it by the chunk
>> offset or something like that, but that would assume the
>> *arrow::ListArray::values* method always point to the same memory
>> location for all chunks, which doesn't seem to be the case for other files.
>>
>> Thanks for the ArrowWriteProperties tip.
>>
>> Best,
>> Arthur
>>
>> ------------------------------
>>
>> *De:* Alan Souza via user <[email protected]>
>> *Enviado:* quarta-feira, 16 de novembro de 2022 11:02
>> *Para:* [email protected] <[email protected]>
>> *Assunto:* Re: [C++] Need an example on how to extract data from a
>> column of type Array(int64) with multiple chunks
>>
>>
>> Hello Arthur. I am using something like this:
>>
>>
>> auto chunked_column = table->GetColumnByName(col_name);
>> auto listArray = std::static_pointer_cast<arrow::LargeListArray
>> >(chunked_column->chunk(0));* // I have only one chunk but this is not a
>> problem*
>> auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->
>> values());
>>
>> In this example I am using the LargeListArray but it is similar to the
>> ListArray
>>
>> Not related to your issue. but is necessary to customize the options of
>> the ArrowWriterProperties to save all the type information, for instance:
>>
>> parquet::ArrowWriterProperties::Builder builder;
>> builder.store_schema();
>>
>>
>> Without this the parquet file is created by the arrow library has a
>> ListArray instead of using a LargeListArray on these columns.
>>
>> On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos <
>> [email protected]> wrote:
>>
>>
>> Hi Niranda
>>
>> Yes, one of the columns (there are over 50 columns in this file), is of
>> type List<Int64>. You can see that in the parquet-tools inspect output
>> below:
>>
>> arthur@arthur:~/parquet-validation$ parquet-tools inspect
>> ~/Downloads/test_file.parquet | grep test_array_column -A 10
>> path: test_array_column.list.element
>> max_definition_level: 2
>> max_repetition_level: 1
>> physical_type: INT64
>> logical_type: None
>> converted_type (legacy): NONE
>> compression: GZIP (space_saved: 56%)
>>
>>
>> As far as I know, the arrow lib represents List columns with an array of
>> offsets and one or more chunks of memory storing the nested column data ().
>> On my side, I have a very similar structure, so I would like to extract
>> both the array of offsets and the nested column data with the less amount
>> of copying possible.
>>
>> Best,
>> Arthur
>>
>>
>> ------------------------------
>>
>> *De:* Niranda Perera <[email protected]>
>> *Enviado:* quarta-feira, 16 de novembro de 2022 10:28
>> *Para:* [email protected] <[email protected]>
>> *Assunto:* Re: [C++] Need an example on how to extract data from a
>> column of type Array(int64) with multiple chunks
>>
>> Hi Arthur,
>>
>> I'm not very clear about the usecase here. Just to clarify, in your
>> original parquet file, do you have List<int64> typed columns?
>>
>> On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]>
>> wrote:
>>
>> Hi
>>
>> I am reading a parquet file with arrow::RecordBatchReader and the
>> arrow::Table returned contains columns with two chunks
>> (column->num_chunks() == 2). The column in question, although not limited
>> to, is of type Array(Int64).
>>
>> I want to extract the data (nested column data) as well as the offsets
>> from that column. I have found only one example
>> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
>>  of Array columns and it assumes the nested type is known at compile
>> time AND the column has only one chunk.
>>
>> I have tried to loop over the Array(Int64) column chunks and grab the
>> `values()` member, but for some reason, for that specific Parquet file, the
>> values member point to the same memory location. Therefore, if I do
>> something like the below, I end up with duplicated data:
>>
>>
>> static std::shared_ptr<arrow::ChunkedArray> 
>> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
>> {    arrow::ArrayVector array_vector;    
>> array_vector.reserve(arrow_column->num_chunks());    for (size_t chunk_i = 
>> 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < 
>> num_chunks; ++chunk_i)      {          arrow::ListArray & list_chunk = 
>> dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i)));          
>> std::shared_ptr<arrow::Array> chunk = list_chunk.values();          
>> array_vector.emplace_back(std::move(chunk));      }    return 
>> std::make_shared<arrow::ChunkedArray>(array_vector);
>> }
>>
>>
>> I can provide more info, but to keep the initial request short and
>> simple, I'll leave it at that.
>>
>> Thanks in advance,
>> Arthur
>>
>>
>>
>> --
>>
>> Niranda Perera
>> https://niranda.dev/
>> @n1r44 <https://twitter.com/N1R44>
>>
>>
>>
>> --
>>
>> Niranda Perera
>> https://niranda.dev/
>> @n1r44 <https://twitter.com/N1R44>
>>
>>
>>

Reply via email to