Did you check the offset array? AFAIU one way of constructing chunks of
list arrays, is duplicating a global value array, and having monotonically
increasing offsets in the offset arrays.
If the offsets are all zero-based, it would be a bug.

On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]> wrote:

> Hi Alan,
>
> In my case, *arrow::ListArray::values* seems to point to the same memory
> location for all chunks. It feels like I need to offset it by the chunk
> offset or something like that, but that would assume the
> *arrow::ListArray::values* method always point to the same memory
> location for all chunks, which doesn't seem to be the case for other files.
>
> Thanks for the ArrowWriteProperties tip.
>
> Best,
> Arthur
> ------------------------------
> *De:* Alan Souza via user <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 11:02
> *Para:* [email protected] <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Hello Arthur. I am using something like this:
>
>
> auto chunked_column = table->GetColumnByName(col_name);
> auto listArray = std::static_pointer_cast<arrow::LargeListArray
> >(chunked_column->chunk(0)); // I have only one chunk but this is not a
> problem
> auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values
> ());
>
> In this example I am using the LargeListArray but it is similar to the
> ListArray
>
> Not related to your issue. but is necessary to customize the options of
> the ArrowWriterProperties to save all the type information, for instance:
>
> parquet::ArrowWriterProperties::Builder builder;
> builder.store_schema();
>
>
> Without this the parquet file is created by the arrow library has a
> ListArray instead of using a LargeListArray on these columns.
>
> On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos <
> [email protected]> wrote:
>
>
> Hi Niranda
>
> Yes, one of the columns (there are over 50 columns in this file), is of
> type List<Int64>. You can see that in the parquet-tools inspect output
> below:
>
> arthur@arthur:~/parquet-validation$ parquet-tools inspect
> ~/Downloads/test_file.parquet | grep test_array_column -A 10
> path: test_array_column.list.element
> max_definition_level: 2
> max_repetition_level: 1
> physical_type: INT64
> logical_type: None
> converted_type (legacy): NONE
> compression: GZIP (space_saved: 56%)
>
>
> As far as I know, the arrow lib represents List columns with an array of
> offsets and one or more chunks of memory storing the nested column data ().
> On my side, I have a very similar structure, so I would like to extract
> both the array of offsets and the nested column data with the less amount
> of copying possible.
>
> Best,
> Arthur
>
> ------------------------------
> *De:* Niranda Perera <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 10:28
> *Para:* [email protected] <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column
> of type Array(int64) with multiple chunks
>
> Hi Arthur,
>
> I'm not very clear about the usecase here. Just to clarify, in your
> original parquet file, do you have List<int64> typed columns?
>
> On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]>
> wrote:
>
> Hi
>
> I am reading a parquet file with arrow::RecordBatchReader and the
> arrow::Table returned contains columns with two chunks
> (column->num_chunks() == 2). The column in question, although not limited
> to, is of type Array(Int64).
>
> I want to extract the data (nested column data) as well as the offsets
> from that column. I have found only one example
> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
>  of Array columns and it assumes the nested type is known at compile time
> AND the column has only one chunk.
>
> I have tried to loop over the Array(Int64) column chunks and grab the
> `values()` member, but for some reason, for that specific Parquet file, the
> values member point to the same memory location. Therefore, if I do
> something like the below, I end up with duplicated data:
>
> static std::shared_ptr<arrow::ChunkedArray> 
> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
> {    arrow::ArrayVector array_vector;    
> array_vector.reserve(arrow_column->num_chunks());    for (size_t chunk_i = 0, 
> num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < 
> num_chunks; ++chunk_i)      {          arrow::ListArray & list_chunk = 
> dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i)));          
> std::shared_ptr<arrow::Array> chunk = list_chunk.values();          
> array_vector.emplace_back(std::move(chunk));      }    return 
> std::make_shared<arrow::ChunkedArray>(array_vector);
> }
>
>
> I can provide more info, but to keep the initial request short and simple,
> I'll leave it at that.
>
> Thanks in advance,
> Arthur
>
>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Reply via email to