Re: [PROPOSAL] Parquet

Henry Saputra Fri, 16 May 2014 11:50:40 -0700

Like most new projects coming to incubator I always check on how busy
is the existing user@ list in Parquet right now?


If most questions or concerns are related to development, I would
suggest to remove request for user@ list for now to get the project
focus on moving to ASF development infrastructure.

Other than that, proposal looks good and looking forward for VOTE thread.

Thanks,

Henry

On Wed, May 14, 2014 at 10:40 AM, Jake Farrell <jfarr...@apache.org> wrote:
> Changed some of the mailing lists requested to match the format currently
> in use. One question I had was do we plan to merge parquet-mr and
> parquet-format into one parquet repo as listed in the proposal or keep them
> separated? Other than that looks good
>
> -Jake
>
>
> On Mon, May 12, 2014 at 1:02 PM, Chris Aniszczyk <caniszc...@gmail.com>wrote:
>
>> We would like to propose Parquet as an Apache Incubator project.
>> https://wiki.apache.org/incubator/ParquetProposal
>>
>> Feel free to comment, we'll go for a vote in a week or two or whenever
>> consensus has been reached on the proposal.
>>
>> I've posted posted the text of the proposal below:
>>
>> == Abstract ==
>> Parquet is a columnar storage format for Hadoop.
>>
>> == Proposal ==
>>
>> We created Parquet to make the advantages of compressed, efficient columnar
>> data representation available to any project in the Hadoop ecosystem,
>> regardless of the choice of data processing framework, data model, or
>> programming language.
>>
>> == Background ==
>>
>> Parquet is built from the ground up with complex nested data structures in
>> mind, and uses the repetition/definition level approach to encoding such
>> data structures, as popularized by Google Dremel (
>> https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We believe
>> this approach is superior to simple flattening of nested name spaces.
>>
>> Parquet is built to support very efficient compression and encoding
>> schemes. Parquet allows compression schemes to be specified on a per-column
>> level, and is future-proofed to allow adding more encodings as they are
>> invented and implemented. We separate the concepts of encoding and
>> compression, allowing parquet consumers to implement operators that work
>> directly on encoded data without paying decompression and decoding penalty
>> when possible.
>>
>> == Rationale ==
>>
>> Parquet is built to be used by anyone. We believe that an efficient,
>> well-implemented columnar storage substrate should be useful to all
>> frameworks without the cost of extensive and difficult to set up
>> dependencies.
>>
>> Furthermore, the rapid growth of Parquet community is empowered by open
>> source. We believe the Apache foundation is a great fit as the long-term
>> home for Parquet, as it provides an established process for
>> community-driven development and decision making by consensus. This is
>> exactly the model we want for future Parquet development.
>>
>> == Initial Goals ==
>>
>> * Move the existing codebase to Apache
>> * Integrate with the Apache development process
>> * Ensure all dependencies are compliant with Apache License version 2.0
>> * Incremental development and releases per Apache guidelines
>>
>> == Current Status ==
>>
>> Parquet has undergone 2 major releases:
>> https://github.com/Parquet/parquet-format/releases of the core format and
>> 22 releases: https://github.com/Parquet/parquet-mr/releases of the
>> supporting set of Java libraries.
>>
>> The Parquet source is currently hosted at GitHub, which will seed the
>> Apache git repository.
>>
>> === Meritocracy ===
>>
>> We plan to invest in supporting a meritocracy. We will discuss the
>> requirements in an open forum. Several companies have already expressed
>> interest in this project, and we intend to invite additional developers to
>> participate. We will encourage and monitor community participation so that
>> privileges can be extended to those that contribute.
>>
>> === Community ===
>>
>> There is a large need for an advanced columnar storage format for Hadoop.
>> Parquet is being used in production by many organizations (see
>> https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>>
>>  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>>  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>>  * Salesforce: https://twitter.com/TwitterOSS/statuses/392734610116726784
>>  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>>  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>>
>> By bringing Parquet into Apache, we believe that the community will grow
>> even bigger.
>>
>> === Core Developers ===
>>
>> Parquet was initially developed as a collaboration between Twitter,
>> Cloudera and Criteo.
>>
>> See
>>
>> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>>
>> === Alignment ===
>>
>> We believe that having Parquet at Apache will help further the growth of
>> the big-data community, as it will encourage cooperation within the greater
>> ecosystem of projects spawned by Apache Hadoop. The alignment is also
>> beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>>
>> == Known Risks ==
>>
>> === Orphaned Products ===
>>
>> The risk of the Parquet project being abandoned is minimal. There are many
>> organizations using Parquet in production, including Twitter, Cloudera,
>> Stripe, and Salesforce (
>> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>>
>> === Inexperience with Open Source ===
>>
>> Parquet has existed as a healthy open source for one year. During that
>> time, we have curated an open-source community successfully, attracting
>> over 40 contributors (see
>> https://github.com/Parquet/parquet-mr/graphs/contributors) from a diverse
>> group of companies.
>> Several of the core contributors to the project are deeply familiar with
>> OSS and Apache specifically: Julien Le Dem is the current PMC Chair for
>> Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney are
>> also Apache Pig committers with contributions to several other Apache
>> projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>> multiple other related projects. Brock Noland is a Hive committer.
>>
>> === Homogenous Developers ===
>>
>> The initial committers come from a number of companies and countries.
>> Parquet has an active community of developers, and we are committed to
>> recruiting additional committers based on their contributions to the
>> project. The java library component alone has contributions from 31
>> individual github accounts, 14 of which contributed over 1000 lines of
>> code.
>>
>> === Reliance on Salaried Developers ===
>>
>> It is expected that Parquet development will occur on both salaried time
>> and on volunteer time, after hours. The majority of initial committers are
>> paid by their employers to contribute to this project. However, they are
>> all passionate about the project, and we are confident that the project
>> will continue even if no salaried developers contribute to the project. As
>> evidence of this statement, we present the GitHub punchcard (see
>> https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>> lot
>> of activity happens on weekends. We are committed to recruiting additional
>> committers including non-salaried developers.
>>
>> === Relationships with Other Apache Products ===
>>
>> As mentioned in the Alignment section, Parquet is closely related to
>> Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look
>> forward to collaborating with those communities, as well as other Apache
>> communities (including Apache S4 which focuses on stateful low-latency
>> processing).
>>
>> === An Excessive Fascination with the Apache Brand ===
>>
>> Parquet is an already healthy and well known open source project. This
>> proposal is not for the purpose of generating publicity. Rather, the
>> primary benefits to joining Apache are those outlined in the Rationale
>> section.
>>
>> == Documentation ==
>>
>> Documentation is currently located as README markdown files:
>>
>> * https://github.com/Parquet/parquet-format
>> * https://github.com/Parquet/parquet-mr
>>
>> == Source and Intellectual Property Submission Plan ==
>>
>> The Parquet codebase is currently hosted on Github:
>> https://github.com/Parquet.
>>
>> This is the exact codebase that we would migrate to the Apache foundation.
>>
>> == External Dependencies ==
>>
>>  * Junit: EPL
>>  * Apache Commons: ALv2
>>  * Apache Thrift: ALv2
>>  * Apache Maven: ALv2
>>  * Apache Avro: ALv2
>>  * Apache Hadoop: ALv2
>>  * Google Guava: ALv2
>>
>> == Cryptography ==
>>
>> We do not expect Parquet to be a controlled export item due to the use of
>> encryption.
>>
>> == Required Resources ==
>>
>> === Mailing lists ===
>>
>>  * parquet-dev
>>  * parquet-user
>>
>> == Subversion Directory ==
>>
>> Git is the preferred source control system: git://git.apache.org/parquet
>>
>> == Issue Tracking ==
>>
>> JIRA: Parquet (PARQUET)
>>
>> == Initial Committers ==
>>
>>  * Aniket Mokashi
>>  * Brock Noland
>>  * Chris Aniszczyk <z...@twitter.com>
>>  * Dmitriy Ryaboy <dmit...@twitter.com>
>>  * Jake Farrell
>>  * Julien Le Dem <jul...@apache.org>
>>  * Lukas Nalezenec
>>  * Marcel Kornacker
>>  * Mickael Lacour
>>  * Nong Li
>>  * Remy Pecqueur
>>  * Tianshuo Deng
>>  * Tom White
>>
>> == Affiliations ==
>>
>>  * Aniket Mokashi - Twitter
>>  * Brock Noland - Cloudera
>>  * Chris Aniszczyk - Twitter
>>  * Dmitriy Ryaboy - Twitter
>>  * Jake Farrell
>>  * Julien Le Dem - Twitter
>>  * Lukas Nalezenec
>>  * Marcel Kornacker - Cloudera
>>  * Mickael Lacour - Criteo
>>  * Nong Li - Cloudera
>>  * Remy Pecqueur - Criteo
>>  * Tianshuo Deng - Twitter
>>  * Tom White - Cloudera
>>
>> == Sponsors ==
>>
>> === Champion ===
>>
>>  * Todd Lipcon
>>
>> === Nominated Mentors ===
>>
>>  * Tom White
>>  * Chris Mattmann
>>  * Jake Farrell
>>
>> === Sponsoring Entity ===
>>
>> The Apache Incubator
>>
>> --
>> Cheers,
>>
>> Chris Aniszczyk
>> http://aniszczyk.org
>> +1 512 961 6719
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Parquet

Reply via email to