Re: [C++][Parquet] Field selection of complex field types

Micah Kornfield Tue, 13 Dec 2022 22:13:40 -0800

>
> I would say that indeed the documentation/parameters name could be
> clearer, because it is quite hard to know to which level the column indices
> refer to.



This would be a great contribution if you are willing to take a stab at it
I am happy to review.

Thanks,
Micah


On Thu, Dec 8, 2022 at 12:43 AM Louis C <[email protected]> wrote:

> Hello Micah,
> Thanks for your answer. I ended up doing method 1, and my code now runs
> correctly (that was not too hard to do using the schema_fields member).
> I would say that indeed the documentation/parameters name could be
> clearer, because it is quite hard to know to which level the column indices
> refer to.
>
> Cheers,
> Louis Calot
> ------------------------------
> *De :* Micah Kornfield <[email protected]>
> *Envoyé :* mercredi 7 décembre 2022 04:59
> *À :* [email protected] <[email protected]>
> *Objet :* Re: [C++][Parquet] Field selection of complex field types
>
> Hi Louis,
>
> In the code we also see (parquet/arrow/reader.h line 208):
> /// The indicated column indices are relative to the schema
> which would mean that this is not the intended behaviour.
>
>
> I think this could be a documentation and parameter name could be clearer
> as the definitions of indices are not well defined and differ by method
> call.  column_indices for ReadRowGroups take leaf parquet column indices as
> the columns it selects which is why you are seeing that behavior.
> Ultimately, these get translated to top level indices via
> Schema.GetFieldIndices [1]
>
>
>
> So is that normal and how to import only certain fields (at the higher
> level, not sub fields) ?
>
> Unfortunately, as far as I know this would be a DIY in one of two ways:
> 1.  Do a traversal of root elements in the schema [2] and retrieve all the
> leaf indices
> 2.  Use the GetColumn API calls, which I believe take top level fields for
> reading, and piece together a Table in your code. [3]
> 3.  Contribute a patch which can take top level field indices.  I think
> the main challenge here is naming/distinguishing this from existing APIs.
> Givent the proliferation of APIs I'm not sure adding a new one is a great
> idea because it adds to the confusion (maybe contributing a utility method
> to do the traversal mentioned in 1 is more practical).
>
> Cheers,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L158
> [2]
> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L115
> [3]
> https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/reader.h#L127
>
>
> On Mon, Dec 5, 2022 at 6:11 AM Louis C <[email protected]> wrote:
>
> Hello,
> I use the
>
> parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups,
>
>                                         const std::vector<int>& 
> column_indices,
>                                         std::shared_ptr<::arrow::Table>* out)
>
>
>  for importing Parquet data. However, when dealing with the table
> blogs.parquet
> <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
> I came across a problem : the number of fields of the table (when querying
> the import object) was 2, but when I tried to import the 2 fields (putting
> column_indices as {0,1} in C++), it only returned the first field. The
> reason seems to be that the first field is a struct with 2 sub elements,
> and the parquet reader takes into account the sub elements of the fields
> when it chooses the fields to output.
> For reference, here is the structure of the table that pyarrow returns :
> pyarrow.Table
> reply: struct<reply_id: int32 not null, next_id: int32>
>   child 0, reply_id: int32 not null
>   child 1, next_id: int32
> blog_id: int64
>
> So my question will be :
> Is that the intended behaviour (parquet reader dealing with column_indices
> as refering to sub fields) ? In this case I think it will be a bit
> incoherent with what is done with
>
> Result<std::shared_ptr<RecordBatch>> SelectColumns(
>       const std::vector<int>& indices) const;
>
> from the RecordBatch class.
> In the code we also see (parquet/arrow/reader.h line 208):
>
> /// The indicated column indices are relative to the schema
>
> which would mean that this is not the intended behaviour.
> So is that normal and how to import only certain fields (at the higher
> level, not sub fields) ?
>
> Best regards,
> Louis Calot
>
> <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
> arrow-testing/blogs.parquet at master · apache/arrow-testing
> <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
> Auxiliary testing files for Apache Arrow. Contribute to
> apache/arrow-testing development by creating an account on GitHub.
> github.com
> **
> **
> **
> **
>
>

Re: [C++][Parquet] Field selection of complex field types

Reply via email to