+ 1

~Dev

On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <[email protected]>
wrote:

> Hello Community,
>
> I am working on developing a ParquetReaderOperator which will allow apex
> users to read parquet files.
>
> Apache Parquet is a columnar storage format available to any project in the
> Hadoop ecosystem, regardless of the choice of data processing framework,
> data model or programming language.
> For more information : Apache Parquet
> <https://parquet.apache.org/documentation/latest/>
>
> Proposed design :
>
>    1. Develop  AbstractParquetFileReaderOperator that extends
>    from AbstractFileInputOperator.
>    2. Override openFile() method to instantiate a ParquetReader ( reader
>    provided by parquet-mr <https://github.com/Parquet/parquet-mr> project
>    that reads parquet records from a file ) with GroupReadSupport ( records
>    would be read as Group ) .
>    3. Override  readEntity() method to read the records and call
>    convertGroup() method.  Derived classes to override convertGroup()
> method
>    to convert Group to any form required by downstream operators.
>    4. Provide a concrete implementation, ParquetFilePOJOReader operator
>    that extends from AbstractParquetFileReaderOperator and
>    overrides convertGroup() method to convert a given Group to POJO.
>
> Parquet schema and directory path would be inputs to the base operator. For
> ParquetFilePOJOReader, pojo class would also be required.
>
> Please feel free to let me know your thoughts on this.
>
> Thanks,
> Shubham
>

Reply via email to