Hello Micah, Thanks for your answer. I ended up doing method 1, and my code now runs correctly (that was not too hard to do using the schema_fields member). I would say that indeed the documentation/parameters name could be clearer, because it is quite hard to know to which level the column indices refer to.
Cheers, Louis Calot ________________________________ De : Micah Kornfield <[email protected]> Envoyé : mercredi 7 décembre 2022 04:59 À : [email protected] <[email protected]> Objet : Re: [C++][Parquet] Field selection of complex field types Hi Louis, In the code we also see (parquet/arrow/reader.h line 208): /// The indicated column indices are relative to the schema which would mean that this is not the intended behaviour. I think this could be a documentation and parameter name could be clearer as the definitions of indices are not well defined and differ by method call. column_indices for ReadRowGroups take leaf parquet column indices as the columns it selects which is why you are seeing that behavior. Ultimately, these get translated to top level indices via Schema.GetFieldIndices [1] So is that normal and how to import only certain fields (at the higher level, not sub fields) ? Unfortunately, as far as I know this would be a DIY in one of two ways: 1. Do a traversal of root elements in the schema [2] and retrieve all the leaf indices 2. Use the GetColumn API calls, which I believe take top level fields for reading, and piece together a Table in your code. [3] 3. Contribute a patch which can take top level field indices. I think the main challenge here is naming/distinguishing this from existing APIs. Givent the proliferation of APIs I'm not sure adding a new one is a great idea because it adds to the confusion (maybe contributing a utility method to do the traversal mentioned in 1 is more practical). Cheers, Micah [1] https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L158 [2] https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L115 [3] https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/reader.h#L127 On Mon, Dec 5, 2022 at 6:11 AM Louis C <[email protected]<mailto:[email protected]>> wrote: Hello, I use the parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups, const std::vector<int>& column_indices, std::shared_ptr<::arrow::Table>* out) for importing Parquet data. However, when dealing with the table blogs.parquet<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet> I came across a problem : the number of fields of the table (when querying the import object) was 2, but when I tried to import the 2 fields (putting column_indices as {0,1} in C++), it only returned the first field. The reason seems to be that the first field is a struct with 2 sub elements, and the parquet reader takes into account the sub elements of the fields when it chooses the fields to output. For reference, here is the structure of the table that pyarrow returns : pyarrow.Table reply: struct<reply_id: int32 not null, next_id: int32> child 0, reply_id: int32 not null child 1, next_id: int32 blog_id: int64 So my question will be : Is that the intended behaviour (parquet reader dealing with column_indices as refering to sub fields) ? In this case I think it will be a bit incoherent with what is done with Result<std::shared_ptr<RecordBatch>> SelectColumns( const std::vector<int>& indices) const; from the RecordBatch class. In the code we also see (parquet/arrow/reader.h line 208): /// The indicated column indices are relative to the schema which would mean that this is not the intended behaviour. So is that normal and how to import only certain fields (at the higher level, not sub fields) ? Best regards, Louis Calot [https://opengraph.githubassets.com/ae72de1c9388132eba0535ffc338630eca4165eacce66973c3ee3923d6200287/apache/arrow-testing]<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet> arrow-testing/blogs.parquet at master · apache/arrow-testing<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet> Auxiliary testing files for Apache Arrow. Contribute to apache/arrow-testing development by creating an account on GitHub. github.com<http://github.com>
