Hello,
I use the
parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups,
const std::vector<int>& column_indices,
std::shared_ptr<::arrow::Table>* out)
for importing Parquet data. However, when dealing with the table
blogs.parquet<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
I came across a problem : the number of fields of the table (when querying the
import object) was 2, but when I tried to import the 2 fields (putting
column_indices as {0,1} in C++), it only returned the first field. The reason
seems to be that the first field is a struct with 2 sub elements, and the
parquet reader takes into account the sub elements of the fields when it
chooses the fields to output.
For reference, here is the structure of the table that pyarrow returns :
pyarrow.Table
reply: struct<reply_id: int32 not null, next_id: int32>
child 0, reply_id: int32 not null
child 1, next_id: int32
blog_id: int64
So my question will be :
Is that the intended behaviour (parquet reader dealing with column_indices as
refering to sub fields) ? In this case I think it will be a bit incoherent with
what is done with
Result<std::shared_ptr<RecordBatch>> SelectColumns(
const std::vector<int>& indices) const;
from the RecordBatch class.
In the code we also see (parquet/arrow/reader.h line 208):
/// The indicated column indices are relative to the schema
which would mean that this is not the intended behaviour.
So is that normal and how to import only certain fields (at the higher level,
not sub fields) ?
Best regards,
Louis Calot
[https://opengraph.githubassets.com/ae72de1c9388132eba0535ffc338630eca4165eacce66973c3ee3923d6200287/apache/arrow-testing]<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
arrow-testing/blogs.parquet at master ·
apache/arrow-testing<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
Auxiliary testing files for Apache Arrow. Contribute to apache/arrow-testing
development by creating an account on GitHub.
github.com