Introducing Parquet: efficient columnar storage for Hadoop

Jarek Jarcec Cecho Wed, 13 Mar 2013 12:55:48 -0700

Fellow Hive users,
We'd like to introduce a joint project between Twitter and Cloudera engineers 
-- a new columnar storage format for Hadoop called Parquet [1]. Official 
announcement is available on Cloudera blog [2].

Parquet is designed to bring efficient columnar storage to Hadoop. Compared to,
and learning from, the initial work done toward this goal in Trevni, Parquet
includes the following enhancements:

* Efficiently encode nested structures and sparsely populated data based on the
Google Dremel definition/repetition levels
* Provide extensible support for per-column encodings (e.g. delta, run length,
etc)
* Provide extensibility of storing multiple types of data in column data (e.g.
indexes, bloom filters, statistics)*
* Offer better write performance by storing metadata at the end of the file

Based on feedback from the Impala beta and after a joint evaluation with
Twitter, we determined that these further improvements to the Trevni design
were necessary to provide a more efficient format that we can evolve going
forward for production usage. Furthermore, we found it appropriate to host and
develop the columnar file format outside of the Avro project (unlike Trevni,
which is part of Avro) because Avro is just one of many input data formats that
can be used with Parquet.

We created Parquet to make the advantages of compressed, efficient columnar
data representation available to any project in the Hadoop ecosystem,
regardless of the choice of data processing framework, data model, or
programming language.

Parquet is built from the ground up with complex nested data structures in
mind. We adopted the repetition/definition level approach to encoding such data
structures, as described in Google's Dremel paper; we have found this to be a
very efficient method of encoding data in non-trivial object schemas.

Parquet is built to support very efficient compression and encoding schemes.
Parquet allows compression schemes to be specified on a per-column level, and
is future-proofed to allow adding more encodings as they are invented and
implemented. We separate the concepts of encoding and compression, allowing
Parquet consumers to implement operators that work directly on encoded data
without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data
processing frameworks, and we are not interested in playing favorites. We
believe that an efficient, well-implemented columnar storage substrate should
be useful to all frameworks without the cost of extensive and difficult to set
up dependencies.

The initial code defines the file format, provides Java building blocks for
processing columnar data, and implements Hadoop Input/Output Formats, Pig
Storers/Loaders, and an example of a complex integration - Input/Output formats
that can convert Parquet-stored data directly to and from Thrift objects.

Twitter is starting to convert some of its major data source to Parquet in
order to take advantage of the compression and deserialization savings.

Parquet is currently under heavy development. Parquet's near-term roadmap
includes:

1. Hive SerDes (Criteo)
2. Cascading Taps (Criteo)
3. Support for dictionary encoding, zigzag encoding, and RLE encoding of
data (Cloudera and Twitter)
4. Further improvements to Pig support (Twitter)

Company names in parenthesis indicate whose engineers signed up to do the work
- others can feel free to jump in too, of course.

We've also heard requests to provide an Avro container layer, similar to what
we do with Thrift. Seeking volunteers!

We welcome all feedback, patches, and ideas; to foster community development,
we plan to contribute Parquet to the Apache Incubator when the development is
farther along.

Regards,

Nong Li (Cloudera)
Julien Le Dem (Twitter)
Marcel Kornacker (Cloudera)
Todd Lipcon (Cloudera)
Dmitriy Ryaboy (Twitter)
Jonathan Coveney (Twitter)
Justin Coffey (Criteo)
Mickael Lacour (Criteo)
and friends.

Jarcec

Links:
1: http://parquet.github.com
2:
http://blog.cloudera.com/blog/2013/03/introducing-parquet-columnar-storage-for-apache-hadoop/

signature.asc
Description: Digital signature

Introducing Parquet: efficient columnar storage for Hadoop

Reply via email to