+ 1 ~Dev
On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <[email protected]> wrote: > Hello Community, > > I am working on developing a ParquetReaderOperator which will allow apex > users to read parquet files. > > Apache Parquet is a columnar storage format available to any project in the > Hadoop ecosystem, regardless of the choice of data processing framework, > data model or programming language. > For more information : Apache Parquet > <https://parquet.apache.org/documentation/latest/> > > Proposed design : > > 1. Develop AbstractParquetFileReaderOperator that extends > from AbstractFileInputOperator. > 2. Override openFile() method to instantiate a ParquetReader ( reader > provided by parquet-mr <https://github.com/Parquet/parquet-mr> project > that reads parquet records from a file ) with GroupReadSupport ( records > would be read as Group ) . > 3. Override readEntity() method to read the records and call > convertGroup() method. Derived classes to override convertGroup() > method > to convert Group to any form required by downstream operators. > 4. Provide a concrete implementation, ParquetFilePOJOReader operator > that extends from AbstractParquetFileReaderOperator and > overrides convertGroup() method to convert a given Group to POJO. > > Parquet schema and directory path would be inputs to the base operator. For > ParquetFilePOJOReader, pojo class would also be required. > > Please feel free to let me know your thoughts on this. > > Thanks, > Shubham >
