Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Micah Kornfield Wed, 23 Nov 2022 14:17:38 -0800

>
> According to the references you shared, #2 seems to solve the mystery. I
> believe it shows that during *arrow::RecordBatch* construction the data
> is sliced or shared based on the offset being zero-based or monotonically
> increasing.



Right.  This is an implementation detail.  If you really want to do
concatenation in an optimized proper fashion, you need to check for no-null
in the LIST, and that the offsets are contiguous and the value pointers are
equal.  There is nothing in the spec that says offsets in the list need to
start at zero.


On Wed, Nov 23, 2022 at 3:02 AM Arthur Passos <[email protected]> wrote:

> Hi Micah,
>
> According to the references you shared, #2 seems to solve the mystery. I
> believe it shows that during *arrow::RecordBatch* construction the data
> is sliced or shared based on the offset being zero-based or monotonically
> increasing.
>
> If the above is correct, then my workaround/ solution is probably the best
> answer, right? Recapping: check chunks offsets. If they are zero based,
> data is sliced and it is not shared across chunks. If they are not zero
> based, data is shared.
>
> Thanks,
> Arthur
> ------------------------------
> *De:* Micah Kornfield <[email protected]>
> *Enviado:* quarta-feira, 23 de novembro de 2022 05:31
> *Para:* [email protected] <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Actually to answer my own question, this is likely expected behavior.
> What I bet is going on is the String column contains some pretty big
> strings and ends up being chunked itself.  Then the TableBatchReader used
> for iteration [1] slices the list array to even out the chunks [2].
>
> If you wanted to avoid this specifically for Parquet with the current
> implementation you could use ColumnReader [3] directly but I don't think we
> want to guarantee the exact structure of Arrays.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.cc#L1043
> [2]
> https://github.com/apache/arrow/blob/e1cf9041c482486c314a3d143f4b01d35baeaab4/cpp/src/arrow/table.cc#L661
> [3]
> https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.h#L289
>
> On Wed, Nov 23, 2022 at 12:13 AM Micah Kornfield <[email protected]>
> wrote:
>
> Hi Arthur,
>
> I am reading a parquet file with arrow::RecordBatchReader and the
> arrow::Table returned contains columns with two chunks
> (column->num_chunks() == 2). The column in question, although not limited
> to, is of type Array(Int64).
>
>
> Would it be possible to provide the code you are using to instantiate the
> RecordBatchReader and then construct the Table?  With the sample python
> code, reading the data back in python (pq.read_table) produces a single
> chunked array for me.  Also my understanding of the read path of Parquet is
> that it should be creating a new offset of each array it returns (with
> values not shared [1]).  So I think something must be doing this chunking
> at a higher level but could very well be mistaken.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.cc#L620
>
> On Tue, Nov 22, 2022 at 12:22 PM Arthur Passos <[email protected]>
> wrote:
>
> Hello again,
>
> I talked to our customer and he was able to generate a randomized file
> that triggers the discussed situation. Therefore, I have attached both
> files so you can experiment with it.
>
> Files attached:
>
> *shared-data-monotonically-increasing-offset.parquet - *Contains a
> List(Int64) and String column. Internally, it is represented with two
> chunks. For the List(Int64) column, *arrow::ListArray::values *is shared
> across chunks and offsets are monotonically increasing.
>
> *zero-based-offsets.parquet* - Contains a List(Int64) column. It is
> represented with two chunks. *arrow::ListArray::values *is not shared and
> offsets are zero based. This file was generated using the previously shared
> python script.
>
> Thanks,
> Arthur
>
> ------------------------------
> *De:* Arthur Passos <[email protected]>
> *Enviado:* terça-feira, 22 de novembro de 2022 08:02
> *Para:* dl <[email protected]>
> *Assunto:* RE: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Hi David,
>
> Thanks for the response. Just recapping, I have two files that trigger two
> different cases: 1. array data is shared across chunks and 2. array data is
> not shared across chunks. "Array data" being *arrow::ListArray::values*.
> In the former, offsets are monotonically increasing. In the latter, they
> are zero based.
>
> Unfortunately, the file that triggers the first case contains confidential
> data from one of our customers. I have spent a fair amount of time trying
> to generate one, but failed to do so. The latter, I can certainly provide
> an example. Below is a python script that'll generate it.
>
>
> import pyarrow as pa
> import pyarrow.parquet as pq
> import random
>
>
> def gen_array(offset):
> array = []
> array_length = random.randint(0, 9)
> for i in range(array_length):
> array.append(i + offset)
>
> return array
>
>
> def gen_arrays(number_of_arrays):
> list_of_arrays = []
> for i in range(number_of_arrays):
> list_of_arrays.append(gen_array(i))
> return list_of_arrays
>
> arr = pa.array(gen_arrays(70000))
> table  = pa.table([arr], ["arr"])
> pq.write_table(table, "int-list-zero-based-chunked-array.parquet")
>
>
> Thanks,
> Arthur
> ------------------------------
> *De:* David Li <[email protected]>
> *Enviado:* segunda-feira, 21 de novembro de 2022 19:05
> *Para:* dl <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Hi Arthur,
>
> Sorry for the late reply - is it possible to provide examples of each kind
> of file? I can try to take a look, at least the behavior here seems
> confusing.
>
> -David
>
> On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote:
>
> Hi guys.
>
> I could not find written evidence that both shared and non shared 
> *arrow::ListArray::values
> *can co-exist, but that seems to be the case since I have files that
> trigger both cases. If any of you have evidence that supports this or that
> shows this is not accurate, it'll be appreciated.
>
> In any case, what I ended up doing is checking whether the offsets are
> zero-based or not. If the former, that means *arrow::ListArray::values*
> is not shared across chunks. If the latter, it is shared. This leads to the
> following logic for *getNested* and *getOffsets*:
>
> *getNested:*
>
> Loop over all chunks and call *arrow::ListArray::Flatten *to properly
> slice based on offsets. This will avoid duplicated data in case 
> *arrow::ListArray::values()
> *is shared.
>
> *getOffsets:*
>
> Use a variable to control current offset. Loop through all chunks and
> check if the chunk offset is zero. If it is, current_offset is updated to
> the last offset collected. Then, offset is stored as follows: auto offset
> = arrow_offsets.Value(i); offsets_data.emplace_back(start_offset +
> offset);
>
>
> Full code can be found in:
> https://github.com/ClickHouse/ClickHouse/pull/43297
> <https://github.com/ClickHouse/ClickHouse/pull/43297>
> Flatten list type arrow chunks on parsing by arthurpassos · Pull Request
> #43297 · ClickHouse/ClickHouse
> <https://github.com/ClickHouse/ClickHouse/pull/43297>
> Changelog category (leave one): Bug Fix (user-visible misbehavior in
> official stable or prestable release) Changelog entry (a user-readable
> short description of the changes that goes to CHANGELOG...
> github.com
> **
>
>
> Best,
> Arthur
>
>
> ------------------------------
>
> *De:* Arthur Passos <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 16:30
> *Para:* [email protected] <[email protected]>
> *Cc:* Alan Souza <[email protected]>
> *Assunto:* RE: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Hi Niranda,
>
> Yes, the offsets are properly set and if call
> *arrow::ListArray::Flatten()*, it'll slice based on those offsets and
> data will be "correct". The problem is that this is not always true, I have
> just tested against a much simpler test parquet file and this logic doesn't
> apply. The *arrow::ListArray::values *member is not shared across all
> chunks and offsets are all zero-based. The file that triggers the former
> case contains confidential data, but the latter is generated with the below
> python script:
>
> import pyarrow as pa
> import pyarrow.parquet as pq
> arr = pa.array([[1, 2] for i in range(70000)])
> table  = pa.table([arr], ["arr"])
> pq.write_table(table, "a-test.parquet")
>
>
> So it looks like arrow::ListArray::values might or might not be shared
> across chunks. If it's shared, then offsets are not zero based. If it's not
> shared, offsets are zero based. I am under the feeling this is an
> implementation detail and I am facing such problems because I am accessing
> "low level APIs"? If that's so, what would be the proper/ reliable way to
> extract the offsets and nested column data if type is not known at compile
> time AND it might contain multiple chunks.
>
>
> I already shared above how I am extracting the arrow nested column from an
> arrow list column. For reference, the below method is the one used to
> extract the offsets. It starts at index 1 because I do not store 0 offsets.
>
> auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & 
> arrow_column) {
>     std::vector<uint64_t> offsets;
>
>     offsets.reserve(arrow_column->length());
>
>     for (size_t chunk_i = 0, num_chunks = 
> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
> ++chunk_i)
>     {
>         arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
> &>(*(arrow_column->chunk(chunk_i)));
>         auto arrow_offsets_array = list_chunk.offsets();
>         auto & arrow_offsets = dynamic_cast<arrow::Int32Array 
> &>(*arrow_offsets_array);
>         for (int64_t i = 1; i < arrow_offsets.length(); ++i)
>             offsets.emplace_back(arrow_offsets.Value(i));
>     }
>     return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets));
> }
>
> Numeric column (Int64) data extraction is with the below method:
>
> template <typename NumericType>
> static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & 
> arrow_column)
> {
>     std::vector<NumericType> array;
>
>     for (size_t chunk_i = 0, num_chunks = 
> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
> ++chunk_i)
>     {
>         std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i);
>         auto chunk_length = chunk->length();
>         if (chunk_length == 0)
>             continue;
>
>         /// buffers[0] is a null bitmap and buffers[1] are actual values
>         std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
>         const auto * raw_data = reinterpret_cast<const NumericType 
> *>(buffer->data());
>         array.insert(array.end(), raw_data, raw_data + chunk_length);
>     }
>
>     return std::make_shared<NumericColumn<NumericType>>(std::move(array));
> }
>
> Last but not least, these methods get called recursively by the below
> readArrowColumn:
>
> std::shared_ptr<Column> readArrowColumn(auto arrow_column) {
>     switch (arrow_column->type()->id()) {
>         case arrow::Type::*INT64*:
>         {
>             return readNumericColumn<uint64_t>(arrow_column);
>         }
>         case arrow::Type::*LIST*:
>         {
>             auto arrow_nested_column = getNestedArrowColumn(arrow_column);
>             auto nested_column = readArrowColumn(arrow_nested_column);
>             auto offsets_column = 
> readOffsetsFromArrowListColumn(arrow_column);
>             return std::make_shared<ArrayColumn>(nested_column, 
> offsets_column);
>         }
>     }
>     return nullptr;
>
> }
>
>
> Thanks,
> Arthur
>
> ------------------------------
>
> *De:* Niranda Perera <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 12:55
> *Para:* [email protected] <[email protected]>
> *Cc:* Alan Souza <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Did you check the offset array? AFAIU one way of constructing chunks of
> list arrays, is duplicating a global value array, and having monotonically
> increasing offsets in the offset arrays.
> If the offsets are all zero-based, it would be a bug.
>
> On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]>
> wrote:
>
> Hi Alan,
>
> In my case, *arrow::ListArray::values* seems to point to the same memory
> location for all chunks. It feels like I need to offset it by the chunk
> offset or something like that, but that would assume the
> *arrow::ListArray::values* method always point to the same memory
> location for all chunks, which doesn't seem to be the case for other files.
>
> Thanks for the ArrowWriteProperties tip.
>
> Best,
> Arthur
>
> ------------------------------
>
> *De:* Alan Souza via user <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 11:02
> *Para:* [email protected] <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
>
> Hello Arthur. I am using something like this:
>
>
> auto chunked_column = table->GetColumnByName(col_name);
> auto listArray = std::static_pointer_cast<arrow::LargeListArray
> >(chunked_column->chunk(0));* // I have only one chunk but this is not a
> problem*
> auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values
> ());
>
> In this example I am using the LargeListArray but it is similar to the
> ListArray
>
> Not related to your issue. but is necessary to customize the options of
> the ArrowWriterProperties to save all the type information, for instance:
>
> parquet::ArrowWriterProperties::Builder builder;
> builder.store_schema();
>
>
> Without this the parquet file is created by the arrow library has a
> ListArray instead of using a LargeListArray on these columns.
>
> On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos <
> [email protected]> wrote:
>
>
> Hi Niranda
>
> Yes, one of the columns (there are over 50 columns in this file), is of
> type List<Int64>. You can see that in the parquet-tools inspect output
> below:
>
> arthur@arthur:~/parquet-validation$ parquet-tools inspect
> ~/Downloads/test_file.parquet | grep test_array_column -A 10
> path: test_array_column.list.element
> max_definition_level: 2
> max_repetition_level: 1
> physical_type: INT64
> logical_type: None
> converted_type (legacy): NONE
> compression: GZIP (space_saved: 56%)
>
>
> As far as I know, the arrow lib represents List columns with an array of
> offsets and one or more chunks of memory storing the nested column data ().
> On my side, I have a very similar structure, so I would like to extract
> both the array of offsets and the nested column data with the less amount
> of copying possible.
>
> Best,
> Arthur
>
>
> ------------------------------
>
> *De:* Niranda Perera <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 10:28
> *Para:* [email protected] <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Hi Arthur,
>
> I'm not very clear about the usecase here. Just to clarify, in your
> original parquet file, do you have List<int64> typed columns?
>
> On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]>
> wrote:
>
> Hi
>
> I am reading a parquet file with arrow::RecordBatchReader and the
> arrow::Table returned contains columns with two chunks
> (column->num_chunks() == 2). The column in question, although not limited
> to, is of type Array(Int64).
>
> I want to extract the data (nested column data) as well as the offsets
> from that column. I have found only one example
> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
>  of Array columns and it assumes the nested type is known at compile time
> AND the column has only one chunk.
>
> I have tried to loop over the Array(Int64) column chunks and grab the
> `values()` member, but for some reason, for that specific Parquet file, the
> values member point to the same memory location. Therefore, if I do
> something like the below, I end up with duplicated data:
>
>
> static std::shared_ptr<arrow::ChunkedArray> 
> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
> {    arrow::ArrayVector array_vector;    
> array_vector.reserve(arrow_column->num_chunks());    for (size_t chunk_i = 0, 
> num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < 
> num_chunks; ++chunk_i)      {          arrow::ListArray & list_chunk = 
> dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i)));          
> std::shared_ptr<arrow::Array> chunk = list_chunk.values();          
> array_vector.emplace_back(std::move(chunk));      }    return 
> std::make_shared<arrow::ChunkedArray>(array_vector);
> }
>
>
> I can provide more info, but to keep the initial request short and simple,
> I'll leave it at that.
>
> Thanks in advance,
> Arthur
>
>
>
> --
>
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>
>
> --
>
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>
>

Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Reply via email to