Does it make sense to use InputSplit in FileSplitterInput to generate the filesplit information, and InputFormat in BlockReader to read the records, This way we could read variety of formats already supported by hadoop in Apex. Parque has an InputFormat and InputSplit defined.
- Tushar. On Mon, Mar 14, 2016 at 5:54 PM, Pradeep Dalvi < [email protected]> wrote: > +1 > > On Mon, Mar 14, 2016 at 5:19 PM, Chinmay Kolhatkar <[email protected]> > wrote: > > > +1. > > > > On Mon, Mar 14, 2016 at 2:55 PM, Devendra Tagare < > > [email protected]> > > wrote: > > > > > Hi, > > > > > > Using parquet.block.size = 128/256 MB on the writer side will ensure > that > > > the column chunks are not stripped across blocks for a large file. > > > > > > The reader can then read the individual row groups iteratively. > > > > > > The FileSplitter would then split the files at the given size into > > separate > > > chunks that can be handled downstream. > > > > > > Dev > > > > > > On Mon, Mar 14, 2016 at 12:31 PM, Shubham Pathak < > > [email protected]> > > > wrote: > > > > > > > @Tushar, > > > > > > > > A parquet file looks like this: > > > > > > > > 4-byte magic number "PAR1" > > > > <Column 1 Chunk 1 + Column Metadata> > > > > <Column 2 Chunk 1 + Column Metadata> > > > > ... > > > > <Column N Chunk 1 + Column Metadata> > > > > <Column 1 Chunk 2 + Column Metadata> > > > > <Column 2 Chunk 2 + Column Metadata> > > > > ... > > > > <Column N Chunk 2 + Column Metadata> > > > > ... > > > > <Column 1 Chunk M + Column Metadata> > > > > <Column 2 Chunk M + Column Metadata> > > > > ... > > > > <Column N Chunk M + Column Metadata> > > > > File Metadata > > > > 4-byte length in bytes of file metadata > > > > 4-byte magic number "PAR1" > > > > > > > > Praquet being a binary columnar storage format, readers are expected > > > > to first read the file metadata to find all the column chunks they > are > > > > interested in. The columns chunks should then be read sequentially. > > > > > > > > > > > > > > > > On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra < > > [email protected] > > > > > > > > wrote: > > > > > > > > > +1 for Parquet reader. > > > > > > > > > > ~ Yogi > > > > > > > > > > On 14 March 2016 at 11:41, Yogi Devendra <[email protected]> > > > > wrote: > > > > > > > > > > > Shubham, > > > > > > > > > > > > I feel that instead of having an operator; it should be a plugin > to > > > the > > > > > > input operator. > > > > > > > > > > > > So that, if someone has some other input operator for a > particular > > > file > > > > > > system (extending AbstractFileInputOperator) he should be able to > > > read > > > > > > Parquet file from that file system using this plugin. > > > > > > > > > > > > ~ Yogi > > > > > > > > > > > > On 14 March 2016 at 11:31, Tushar Gosavi <[email protected] > > > > > > wrote: > > > > > > > > > > > >> +1 > > > > > >> > > > > > >> Does Parquet support partitioned read from a single file? If yes > > > then > > > > > may > > > > > >> be we can also add support in FileSplitterInput and BlockReader > to > > > > read > > > > > >> single file parallely. > > > > > >> > > > > > >> - Tushar. > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare < > > > > > >> [email protected] > > > > > >> > wrote: > > > > > >> > > > > > >> > + 1 > > > > > >> > > > > > > >> > ~Dev > > > > > >> > > > > > > >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak < > > > > > >> [email protected]> > > > > > >> > wrote: > > > > > >> > > > > > > >> > > Hello Community, > > > > > >> > > > > > > > >> > > I am working on developing a ParquetReaderOperator which > will > > > > allow > > > > > >> apex > > > > > >> > > users to read parquet files. > > > > > >> > > > > > > > >> > > Apache Parquet is a columnar storage format available to any > > > > project > > > > > >> in > > > > > >> > the > > > > > >> > > Hadoop ecosystem, regardless of the choice of data > processing > > > > > >> framework, > > > > > >> > > data model or programming language. > > > > > >> > > For more information : Apache Parquet > > > > > >> > > <https://parquet.apache.org/documentation/latest/> > > > > > >> > > > > > > > >> > > Proposed design : > > > > > >> > > > > > > > >> > > 1. Develop AbstractParquetFileReaderOperator that > extends > > > > > >> > > from AbstractFileInputOperator. > > > > > >> > > 2. Override openFile() method to instantiate a > > ParquetReader > > > ( > > > > > >> reader > > > > > >> > > provided by parquet-mr < > > > https://github.com/Parquet/parquet-mr> > > > > > >> > project > > > > > >> > > that reads parquet records from a file ) with > > > GroupReadSupport > > > > ( > > > > > >> > records > > > > > >> > > would be read as Group ) . > > > > > >> > > 3. Override readEntity() method to read the records and > > call > > > > > >> > > convertGroup() method. Derived classes to override > > > > > convertGroup() > > > > > >> > > method > > > > > >> > > to convert Group to any form required by downstream > > > operators. > > > > > >> > > 4. Provide a concrete implementation, > ParquetFilePOJOReader > > > > > >> operator > > > > > >> > > that extends from AbstractParquetFileReaderOperator and > > > > > >> > > overrides convertGroup() method to convert a given Group > to > > > > POJO. > > > > > >> > > > > > > > >> > > Parquet schema and directory path would be inputs to the > base > > > > > >> operator. > > > > > >> > For > > > > > >> > > ParquetFilePOJOReader, pojo class would also be required. > > > > > >> > > > > > > > >> > > Please feel free to let me know your thoughts on this. > > > > > >> > > > > > > > >> > > Thanks, > > > > > >> > > Shubham > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Pradeep A. Dalvi > > Software Engineer > DataTorrent (India) >
