> > I would say that indeed the documentation/parameters name could be > clearer, because it is quite hard to know to which level the column indices > refer to.
This would be a great contribution if you are willing to take a stab at it I am happy to review. Thanks, Micah On Thu, Dec 8, 2022 at 12:43 AM Louis C <[email protected]> wrote: > Hello Micah, > Thanks for your answer. I ended up doing method 1, and my code now runs > correctly (that was not too hard to do using the schema_fields member). > I would say that indeed the documentation/parameters name could be > clearer, because it is quite hard to know to which level the column indices > refer to. > > Cheers, > Louis Calot > ------------------------------ > *De :* Micah Kornfield <[email protected]> > *Envoyé :* mercredi 7 décembre 2022 04:59 > *À :* [email protected] <[email protected]> > *Objet :* Re: [C++][Parquet] Field selection of complex field types > > Hi Louis, > > In the code we also see (parquet/arrow/reader.h line 208): > /// The indicated column indices are relative to the schema > which would mean that this is not the intended behaviour. > > > I think this could be a documentation and parameter name could be clearer > as the definitions of indices are not well defined and differ by method > call. column_indices for ReadRowGroups take leaf parquet column indices as > the columns it selects which is why you are seeing that behavior. > Ultimately, these get translated to top level indices via > Schema.GetFieldIndices [1] > > > > So is that normal and how to import only certain fields (at the higher > level, not sub fields) ? > > Unfortunately, as far as I know this would be a DIY in one of two ways: > 1. Do a traversal of root elements in the schema [2] and retrieve all the > leaf indices > 2. Use the GetColumn API calls, which I believe take top level fields for > reading, and piece together a Table in your code. [3] > 3. Contribute a patch which can take top level field indices. I think > the main challenge here is naming/distinguishing this from existing APIs. > Givent the proliferation of APIs I'm not sure adding a new one is a great > idea because it adds to the confusion (maybe contributing a utility method > to do the traversal mentioned in 1 is more practical). > > Cheers, > Micah > > [1] > https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L158 > [2] > https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L115 > [3] > https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/reader.h#L127 > > > On Mon, Dec 5, 2022 at 6:11 AM Louis C <[email protected]> wrote: > > Hello, > I use the > > parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups, > > const std::vector<int>& > column_indices, > std::shared_ptr<::arrow::Table>* out) > > > for importing Parquet data. However, when dealing with the table > blogs.parquet > <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet> > I came across a problem : the number of fields of the table (when querying > the import object) was 2, but when I tried to import the 2 fields (putting > column_indices as {0,1} in C++), it only returned the first field. The > reason seems to be that the first field is a struct with 2 sub elements, > and the parquet reader takes into account the sub elements of the fields > when it chooses the fields to output. > For reference, here is the structure of the table that pyarrow returns : > pyarrow.Table > reply: struct<reply_id: int32 not null, next_id: int32> > child 0, reply_id: int32 not null > child 1, next_id: int32 > blog_id: int64 > > So my question will be : > Is that the intended behaviour (parquet reader dealing with column_indices > as refering to sub fields) ? In this case I think it will be a bit > incoherent with what is done with > > Result<std::shared_ptr<RecordBatch>> SelectColumns( > const std::vector<int>& indices) const; > > from the RecordBatch class. > In the code we also see (parquet/arrow/reader.h line 208): > > /// The indicated column indices are relative to the schema > > which would mean that this is not the intended behaviour. > So is that normal and how to import only certain fields (at the higher > level, not sub fields) ? > > Best regards, > Louis Calot > > <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet> > arrow-testing/blogs.parquet at master · apache/arrow-testing > <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet> > Auxiliary testing files for Apache Arrow. Contribute to > apache/arrow-testing development by creating an account on GitHub. > github.com > ** > ** > ** > ** > >
