+1 Does Parquet support partitioned read from a single file? If yes then may be we can also add support in FileSplitterInput and BlockReader to read single file parallely.
- Tushar. On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <[email protected] > wrote: > + 1 > > ~Dev > > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <[email protected]> > wrote: > > > Hello Community, > > > > I am working on developing a ParquetReaderOperator which will allow apex > > users to read parquet files. > > > > Apache Parquet is a columnar storage format available to any project in > the > > Hadoop ecosystem, regardless of the choice of data processing framework, > > data model or programming language. > > For more information : Apache Parquet > > <https://parquet.apache.org/documentation/latest/> > > > > Proposed design : > > > > 1. Develop AbstractParquetFileReaderOperator that extends > > from AbstractFileInputOperator. > > 2. Override openFile() method to instantiate a ParquetReader ( reader > > provided by parquet-mr <https://github.com/Parquet/parquet-mr> > project > > that reads parquet records from a file ) with GroupReadSupport ( > records > > would be read as Group ) . > > 3. Override readEntity() method to read the records and call > > convertGroup() method. Derived classes to override convertGroup() > > method > > to convert Group to any form required by downstream operators. > > 4. Provide a concrete implementation, ParquetFilePOJOReader operator > > that extends from AbstractParquetFileReaderOperator and > > overrides convertGroup() method to convert a given Group to POJO. > > > > Parquet schema and directory path would be inputs to the base operator. > For > > ParquetFilePOJOReader, pojo class would also be required. > > > > Please feel free to let me know your thoughts on this. > > > > Thanks, > > Shubham > > >
