Re: Adding ParquetReaderOperator in Malhar

Chinmay Kolhatkar Mon, 14 Mar 2016 04:50:19 -0700

+1.

On Mon, Mar 14, 2016 at 2:55 PM, Devendra Tagare <[email protected]>
wrote:


> Hi,
>
> Using parquet.block.size = 128/256 MB on the writer side will ensure that
> the column chunks are not stripped across blocks for a large file.
>
> The reader can then read the individual row groups iteratively.
>
> The FileSplitter would then split the files at the given size into separate
> chunks that can be handled downstream.
>
> Dev
>
> On Mon, Mar 14, 2016 at 12:31 PM, Shubham Pathak <[email protected]>
> wrote:
>
> > @Tushar,
> >
> > A parquet file looks like this:
> >
> > 4-byte magic number "PAR1"
> > <Column 1 Chunk 1 + Column Metadata>
> > <Column 2 Chunk 1 + Column Metadata>
> > ...
> > <Column N Chunk 1 + Column Metadata>
> > <Column 1 Chunk 2 + Column Metadata>
> > <Column 2 Chunk 2 + Column Metadata>
> > ...
> > <Column N Chunk 2 + Column Metadata>
> > ...
> > <Column 1 Chunk M + Column Metadata>
> > <Column 2 Chunk M + Column Metadata>
> > ...
> > <Column N Chunk M + Column Metadata>
> > File Metadata
> > 4-byte length in bytes of file metadata
> > 4-byte magic number "PAR1"
> >
> > Praquet being a binary columnar storage format,  readers are expected
> > to first read the file metadata to find all the column chunks they are
> > interested in. The columns chunks should then be read sequentially.
> >
> >
> >
> > On Mon, Mar 14, 2016 at 11:44 AM, Yogi Devendra <[email protected]
> >
> > wrote:
> >
> > > +1 for Parquet reader.
> > >
> > > ~ Yogi
> > >
> > > On 14 March 2016 at 11:41, Yogi Devendra <[email protected]>
> > wrote:
> > >
> > > > Shubham,
> > > >
> > > > I feel that instead of having an operator; it should be a plugin to
> the
> > > > input operator.
> > > >
> > > > So that, if someone has some other input operator for a particular
> file
> > > > system (extending AbstractFileInputOperator) he should be able to
> read
> > > > Parquet file from that file system using this plugin.
> > > >
> > > > ~ Yogi
> > > >
> > > > On 14 March 2016 at 11:31, Tushar Gosavi <[email protected]>
> > wrote:
> > > >
> > > >> +1
> > > >>
> > > >> Does Parquet support partitioned read from a single file? If yes
> then
> > > may
> > > >> be we can also add support in FileSplitterInput and BlockReader to
> > read
> > > >> single file parallely.
> > > >>
> > > >> - Tushar.
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Mar 14, 2016 at 11:23 AM, Devendra Tagare <
> > > >> [email protected]
> > > >> > wrote:
> > > >>
> > > >> > + 1
> > > >> >
> > > >> > ~Dev
> > > >> >
> > > >> > On Mon, Mar 14, 2016 at 11:12 AM, Shubham Pathak <
> > > >> [email protected]>
> > > >> > wrote:
> > > >> >
> > > >> > > Hello Community,
> > > >> > >
> > > >> > > I am working on developing a ParquetReaderOperator which will
> > allow
> > > >> apex
> > > >> > > users to read parquet files.
> > > >> > >
> > > >> > > Apache Parquet is a columnar storage format available to any
> > project
> > > >> in
> > > >> > the
> > > >> > > Hadoop ecosystem, regardless of the choice of data processing
> > > >> framework,
> > > >> > > data model or programming language.
> > > >> > > For more information : Apache Parquet
> > > >> > > <https://parquet.apache.org/documentation/latest/>
> > > >> > >
> > > >> > > Proposed design :
> > > >> > >
> > > >> > >    1. Develop  AbstractParquetFileReaderOperator that extends
> > > >> > >    from AbstractFileInputOperator.
> > > >> > >    2. Override openFile() method to instantiate a ParquetReader
> (
> > > >> reader
> > > >> > >    provided by parquet-mr <
> https://github.com/Parquet/parquet-mr>
> > > >> > project
> > > >> > >    that reads parquet records from a file ) with
> GroupReadSupport
> > (
> > > >> > records
> > > >> > >    would be read as Group ) .
> > > >> > >    3. Override  readEntity() method to read the records and call
> > > >> > >    convertGroup() method.  Derived classes to override
> > > convertGroup()
> > > >> > > method
> > > >> > >    to convert Group to any form required by downstream
> operators.
> > > >> > >    4. Provide a concrete implementation, ParquetFilePOJOReader
> > > >> operator
> > > >> > >    that extends from AbstractParquetFileReaderOperator and
> > > >> > >    overrides convertGroup() method to convert a given Group to
> > POJO.
> > > >> > >
> > > >> > > Parquet schema and directory path would be inputs to the base
> > > >> operator.
> > > >> > For
> > > >> > > ParquetFilePOJOReader, pojo class would also be required.
> > > >> > >
> > > >> > > Please feel free to let me know your thoughts on this.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Shubham
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Adding ParquetReaderOperator in Malhar

Reply via email to