Re: Adding ParquetReaderOperator in Malhar

Shubham Pathak Mon, 14 Mar 2016 00:01:58 -0700

@Tushar,

A parquet file looks like this:


4-byte magic number "PAR1"
<Column 1 Chunk 1 + Column Metadata>
<Column 2 Chunk 1 + Column Metadata>
...
<Column N Chunk 1 + Column Metadata>
<Column 1 Chunk 2 + Column Metadata>
<Column 2 Chunk 2 + Column Metadata>
...
<Column N Chunk 2 + Column Metadata>
...
<Column 1 Chunk M + Column Metadata>
<Column 2 Chunk M + Column Metadata>
...
<Column N Chunk M + Column Metadata>
File Metadata
4-byte length in bytes of file metadata
4-byte magic number "PAR1"

Praquet being a binary columnar storage format,  readers are expected
to first read the file metadata to find all the column chunks they are
interested in. The columns chunks should then be read sequentially.



On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra <[email protected]>
wrote:

> +1 for Parquet reader.
>
> ~ Yogi
>
> On 14 March 2016 at 11:41, Yogi Devendra <[email protected]> wrote:
>
> > Shubham,
> >
> > I feel that instead of having an operator; it should be a plugin to the
> > input operator.
> >
> > So that, if someone has some other input operator for a particular file
> > system (extending AbstractFileInputOperator) he should be able to read
> > Parquet file from that file system using this plugin.
> >
> > ~ Yogi
> >
> > On 14 March 2016 at 11:31, Tushar Gosavi <[email protected]> wrote:
> >
> >> +1
> >>
> >> Does Parquet support partitioned read from a single file? If yes then
> may
> >> be we can also add support in FileSplitterInput and BlockReader to read
> >> single file parallely.
> >>
> >> - Tushar.
> >>
> >>
> >>
> >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <
> >> [email protected]
> >> > wrote:
> >>
> >> > + 1
> >> >
> >> > ~Dev
> >> >
> >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <
> >> [email protected]>
> >> > wrote:
> >> >
> >> > > Hello Community,
> >> > >
> >> > > I am working on developing a ParquetReaderOperator which will allow
> >> apex
> >> > > users to read parquet files.
> >> > >
> >> > > Apache Parquet is a columnar storage format available to any project
> >> in
> >> > the
> >> > > Hadoop ecosystem, regardless of the choice of data processing
> >> framework,
> >> > > data model or programming language.
> >> > > For more information : Apache Parquet
> >> > > <https://parquet.apache.org/documentation/latest/>
> >> > >
> >> > > Proposed design :
> >> > >
> >> > >    1. Develop  AbstractParquetFileReaderOperator that extends
> >> > >    from AbstractFileInputOperator.
> >> > >    2. Override openFile() method to instantiate a ParquetReader (
> >> reader
> >> > >    provided by parquet-mr <https://github.com/Parquet/parquet-mr>
> >> > project
> >> > >    that reads parquet records from a file ) with GroupReadSupport (
> >> > records
> >> > >    would be read as Group ) .
> >> > >    3. Override  readEntity() method to read the records and call
> >> > >    convertGroup() method.  Derived classes to override
> convertGroup()
> >> > > method
> >> > >    to convert Group to any form required by downstream operators.
> >> > >    4. Provide a concrete implementation, ParquetFilePOJOReader
> >> operator
> >> > >    that extends from AbstractParquetFileReaderOperator and
> >> > >    overrides convertGroup() method to convert a given Group to POJO.
> >> > >
> >> > > Parquet schema and directory path would be inputs to the base
> >> operator.
> >> > For
> >> > > ParquetFilePOJOReader, pojo class would also be required.
> >> > >
> >> > > Please feel free to let me know your thoughts on this.
> >> > >
> >> > > Thanks,
> >> > > Shubham
> >> > >
> >> >
> >>
> >
> >
>

Re: Adding ParquetReaderOperator in Malhar

Reply via email to