Hi Niranda,
Yes, the offsets are properly set and if call arrow::ListArray::Flatten(),
it'll slice based on those offsets and data will be "correct". The problem is
that this is not always true, I have just tested against a much simpler test
parquet file and this logic doesn't apply. The arrow::ListArray::values member
is not shared across all chunks and offsets are all zero-based. The file that
triggers the former case contains confidential data, but the latter is
generated with the below python script:
import pyarrow as pa
import pyarrow.parquet as pq
arr = pa.array([[1, 2] for i in range(70000)])
table = pa.table([arr], ["arr"])
pq.write_table(table, "a-test.parquet")
So it looks like arrow::ListArray::values might or might not be shared across
chunks. If it's shared, then offsets are not zero based. If it's not shared,
offsets are zero based. I am under the feeling this is an implementation detail
and I am facing such problems because I am accessing "low level APIs"? If
that's so, what would be the proper/ reliable way to extract the offsets and
nested column data if type is not known at compile time AND it might contain
multiple chunks.
I already shared above how I am extracting the arrow nested column from an
arrow list column. For reference, the below method is the one used to extract
the offsets. It starts at index 1 because I do not store 0 offsets.
auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> &
arrow_column) {
std::vector<uint64_t> offsets;
offsets.reserve(arrow_column->length());
for (size_t chunk_i = 0, num_chunks =
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks;
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray
&>(*(arrow_column->chunk(chunk_i)));
auto arrow_offsets_array = list_chunk.offsets();
auto & arrow_offsets = dynamic_cast<arrow::Int32Array
&>(*arrow_offsets_array);
for (int64_t i = 1; i < arrow_offsets.length(); ++i)
offsets.emplace_back(arrow_offsets.Value(i));
}
return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets));
}
Numeric column (Int64) data extraction is with the below method:
template <typename NumericType>
static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> &
arrow_column)
{
std::vector<NumericType> array;
for (size_t chunk_i = 0, num_chunks =
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks;
++chunk_i)
{
std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i);
auto chunk_length = chunk->length();
if (chunk_length == 0)
continue;
/// buffers[0] is a null bitmap and buffers[1] are actual values
std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
const auto * raw_data = reinterpret_cast<const NumericType
*>(buffer->data());
array.insert(array.end(), raw_data, raw_data + chunk_length);
}
return std::make_shared<NumericColumn<NumericType>>(std::move(array));
}
Last but not least, these methods get called recursively by the below
readArrowColumn:
std::shared_ptr<Column> readArrowColumn(auto arrow_column) {
switch (arrow_column->type()->id()) {
case arrow::Type::INT64:
{
return readNumericColumn<uint64_t>(arrow_column);
}
case arrow::Type::LIST:
{
auto arrow_nested_column = getNestedArrowColumn(arrow_column);
auto nested_column = readArrowColumn(arrow_nested_column);
auto offsets_column = readOffsetsFromArrowListColumn(arrow_column);
return std::make_shared<ArrayColumn>(nested_column, offsets_column);
}
}
return nullptr;
}
Thanks,
Arthur
________________________________
De: Niranda Perera <[email protected]>
Enviado: quarta-feira, 16 de novembro de 2022 12:55
Para: [email protected] <[email protected]>
Cc: Alan Souza <[email protected]>
Assunto: Re: [C++] Need an example on how to extract data from a column of type
Array(int64) with multiple chunks
Did you check the offset array? AFAIU one way of constructing chunks of list
arrays, is duplicating a global value array, and having monotonically
increasing offsets in the offset arrays.
If the offsets are all zero-based, it would be a bug.
On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos
<[email protected]<mailto:[email protected]>> wrote:
Hi Alan,
In my case, arrow::ListArray::values seems to point to the same memory location
for all chunks. It feels like I need to offset it by the chunk offset or
something like that, but that would assume the arrow::ListArray::values method
always point to the same memory location for all chunks, which doesn't seem to
be the case for other files.
Thanks for the ArrowWriteProperties tip.
Best,
Arthur
________________________________
De: Alan Souza via user <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 16 de novembro de 2022 11:02
Para: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Assunto: Re: [C++] Need an example on how to extract data from a column of type
Array(int64) with multiple chunks
Hello Arthur. I am using something like this:
auto chunked_column = table->GetColumnByName(col_name);
auto listArray =
std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0)); // I
have only one chunk but this is not a problem
auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values());
In this example I am using the LargeListArray but it is similar to the ListArray
Not related to your issue. but is necessary to customize the options of the
ArrowWriterProperties to save all the type information, for instance:
parquet::ArrowWriterProperties::Builder builder;
builder.store_schema();
Without this the parquet file is created by the arrow library has a ListArray
instead of using a LargeListArray on these columns.
On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos
<[email protected]<mailto:[email protected]>> wrote:
Hi Niranda
Yes, one of the columns (there are over 50 columns in this file), is of type
List<Int64>. You can see that in the parquet-tools inspect output below:
arthur@arthur:~/parquet-validation$ parquet-tools inspect
~/Downloads/test_file.parquet | grep test_array_column -A 10
path: test_array_column.list.element
max_definition_level: 2
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 56%)
As far as I know, the arrow lib represents List columns with an array of
offsets and one or more chunks of memory storing the nested column data (). On
my side, I have a very similar structure, so I would like to extract both the
array of offsets and the nested column data with the less amount of copying
possible.
Best,
Arthur
________________________________
De: Niranda Perera <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 16 de novembro de 2022 10:28
Para: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>
Assunto: Re: [C++] Need an example on how to extract data from a column of type
Array(int64) with multiple chunks
Hi Arthur,
I'm not very clear about the usecase here. Just to clarify, in your original
parquet file, do you have List<int64> typed columns?
On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos
<[email protected]<mailto:[email protected]>> wrote:
Hi
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table
returned contains columns with two chunks (column->num_chunks() == 2). The
column in question, although not limited to, is of type Array(Int64).
I want to extract the data (nested column data) as well as the offsets from
that column. I have found only one
example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
of Array columns and it assumes the nested type is known at compile time AND
the column has only one chunk.
I have tried to loop over the Array(Int64) column chunks and grab the
`values()` member, but for some reason, for that specific Parquet file, the
values member point to the same memory location. Therefore, if I do something
like the below, I end up with duplicated data:
static std::shared_ptr<arrow::ChunkedArray>
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks =
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks;
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray
&>(*(arrow_column->chunk(chunk_i)));
std::shared_ptr<arrow::Array> chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared<arrow::ChunkedArray>(array_vector);
}
I can provide more info, but to keep the initial request short and simple, I'll
leave it at that.
Thanks in advance,
Arthur
--
Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>
--
Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>