Re: [PROPOSAL] Parquet

Henry Saputra Sat, 17 May 2014 10:20:32 -0700

Chris, could you please address my concern about user@ list

- Henry


On Fri, May 16, 2014 at 4:43 PM, Chris Aniszczyk <caniszc...@gmail.com> wrote:
> SGTM Roman, thanks for volunteering!
>
> I'll start the vote on Sunday barring any issues.
>
>
> On Fri, May 16, 2014 at 11:56 AM, Roman Shaposhnik <r...@apache.org> wrote:
>
>> Hi!
>>
>> proposal looks good to me and I am very much looking
>> for a voting thread.
>>
>> One small request, since I plan to spend a fair amount
>> of time on Parquet anyway, would you guys be ok
>> with adding me as an extra mentor so I can help
>> with that aspect of the project as well?
>>
>> Thanks,
>> Roman.
>>
>> P.S. Plus it has an added benefit of increasing diversity
>> of affiliations from the get go.
>>
>> On Mon, May 12, 2014 at 10:02 AM, Chris Aniszczyk <caniszc...@gmail.com>
>> wrote:
>> > We would like to propose Parquet as an Apache Incubator project.
>> > https://wiki.apache.org/incubator/ParquetProposal
>> >
>> > Feel free to comment, we'll go for a vote in a week or two or whenever
>> > consensus has been reached on the proposal.
>> >
>> > I've posted posted the text of the proposal below:
>> >
>> > == Abstract ==
>> > Parquet is a columnar storage format for Hadoop.
>> >
>> > == Proposal ==
>> >
>> > We created Parquet to make the advantages of compressed, efficient
>> columnar
>> > data representation available to any project in the Hadoop ecosystem,
>> > regardless of the choice of data processing framework, data model, or
>> > programming language.
>> >
>> > == Background ==
>> >
>> > Parquet is built from the ground up with complex nested data structures
>> in
>> > mind, and uses the repetition/definition level approach to encoding such
>> > data structures, as popularized by Google Dremel (
>> > https://blog.twitter.com/2013/dremel-made-simple-with-parquet). We
>> believe
>> > this approach is superior to simple flattening of nested name spaces.
>> >
>> > Parquet is built to support very efficient compression and encoding
>> > schemes. Parquet allows compression schemes to be specified on a
>> per-column
>> > level, and is future-proofed to allow adding more encodings as they are
>> > invented and implemented. We separate the concepts of encoding and
>> > compression, allowing parquet consumers to implement operators that work
>> > directly on encoded data without paying decompression and decoding
>> penalty
>> > when possible.
>> >
>> > == Rationale ==
>> >
>> > Parquet is built to be used by anyone. We believe that an efficient,
>> > well-implemented columnar storage substrate should be useful to all
>> > frameworks without the cost of extensive and difficult to set up
>> > dependencies.
>> >
>> > Furthermore, the rapid growth of Parquet community is empowered by open
>> > source. We believe the Apache foundation is a great fit as the long-term
>> > home for Parquet, as it provides an established process for
>> > community-driven development and decision making by consensus. This is
>> > exactly the model we want for future Parquet development.
>> >
>> > == Initial Goals ==
>> >
>> > * Move the existing codebase to Apache
>> > * Integrate with the Apache development process
>> > * Ensure all dependencies are compliant with Apache License version 2.0
>> > * Incremental development and releases per Apache guidelines
>> >
>> > == Current Status ==
>> >
>> > Parquet has undergone 2 major releases:
>> > https://github.com/Parquet/parquet-format/releases of the core format
>> and
>> > 22 releases: https://github.com/Parquet/parquet-mr/releases of the
>> > supporting set of Java libraries.
>> >
>> > The Parquet source is currently hosted at GitHub, which will seed the
>> > Apache git repository.
>> >
>> > === Meritocracy ===
>> >
>> > We plan to invest in supporting a meritocracy. We will discuss the
>> > requirements in an open forum. Several companies have already expressed
>> > interest in this project, and we intend to invite additional developers
>> to
>> > participate. We will encourage and monitor community participation so
>> that
>> > privileges can be extended to those that contribute.
>> >
>> > === Community ===
>> >
>> > There is a large need for an advanced columnar storage format for Hadoop.
>> > Parquet is being used in production by many organizations (see
>> > https://github.com/Parquet/parquet-mr/blob/master/PoweredBy.md)
>> >
>> >  * Cloudera: https://twitter.com/HenryR/statuses/324222874011451392
>> >  * Criteo: https://twitter.com/julsimon/statuses/312114074911666177
>> >  * Salesforce:
>> https://twitter.com/TwitterOSS/statuses/392734610116726784
>> >  * Stripe: https://twitter.com/avibryant/statuses/391339949250715648
>> >  * Twitter: https://twitter.com/J_/statuses/315844725611581441
>> >
>> > By bringing Parquet into Apache, we believe that the community will grow
>> > even bigger.
>> >
>> > === Core Developers ===
>> >
>> > Parquet was initially developed as a collaboration between Twitter,
>> > Cloudera and Criteo.
>> >
>> > See
>> >
>> https://blog.twitter.com/2013/announcing-parquet-10-columnar-storage-for-hadoop
>> >
>> > === Alignment ===
>> >
>> > We believe that having Parquet at Apache will help further the growth of
>> > the big-data community, as it will encourage cooperation within the
>> greater
>> > ecosystem of projects spawned by Apache Hadoop. The alignment is also
>> > beneficial to other Apache communities (such as Hadoop, Hive, Avro).
>> >
>> > == Known Risks ==
>> >
>> > === Orphaned Products ===
>> >
>> > The risk of the Parquet project being abandoned is minimal. There are
>> many
>> > organizations using Parquet in production, including Twitter, Cloudera,
>> > Stripe, and Salesforce (
>> > http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/).
>> >
>> > === Inexperience with Open Source ===
>> >
>> > Parquet has existed as a healthy open source for one year. During that
>> > time, we have curated an open-source community successfully, attracting
>> > over 40 contributors (see
>> > https://github.com/Parquet/parquet-mr/graphs/contributors) from a
>> diverse
>> > group of companies.
>> > Several of the core contributors to the project are deeply familiar with
>> > OSS and Apache specifically: Julien Le Dem is the current PMC Chair for
>> > Apache Pig, and Dmitriy Ryaboy, Aniket Mokashi, and Jonathan Coveney are
>> > also Apache Pig committers with contributions to several other Apache
>> > projects. Todd Lipcon and Tom White are committers to Apache Hadoop and
>> > multiple other related projects. Brock Noland is a Hive committer.
>> >
>> > === Homogenous Developers ===
>> >
>> > The initial committers come from a number of companies and countries.
>> > Parquet has an active community of developers, and we are committed to
>> > recruiting additional committers based on their contributions to the
>> > project. The java library component alone has contributions from 31
>> > individual github accounts, 14 of which contributed over 1000 lines of
>> code.
>> >
>> > === Reliance on Salaried Developers ===
>> >
>> > It is expected that Parquet development will occur on both salaried time
>> > and on volunteer time, after hours. The majority of initial committers
>> are
>> > paid by their employers to contribute to this project. However, they are
>> > all passionate about the project, and we are confident that the project
>> > will continue even if no salaried developers contribute to the project.
>> As
>> > evidence of this statement, we present the GitHub punchcard (see
>> > https://github.com/Parquet/parquet-mr/graphs/punch-card) showing that a
>> lot
>> > of activity happens on weekends. We are committed to recruiting
>> additional
>> > committers including non-salaried developers.
>> >
>> > === Relationships with Other Apache Products ===
>> >
>> > As mentioned in the Alignment section, Parquet is closely related to
>> > Hadoop, Pig, Avro, Thrift, YARN and Mesos in a numerous ways. We look
>> > forward to collaborating with those communities, as well as other Apache
>> > communities (including Apache S4 which focuses on stateful low-latency
>> > processing).
>> >
>> > === An Excessive Fascination with the Apache Brand ===
>> >
>> > Parquet is an already healthy and well known open source project. This
>> > proposal is not for the purpose of generating publicity. Rather, the
>> > primary benefits to joining Apache are those outlined in the Rationale
>> > section.
>> >
>> > == Documentation ==
>> >
>> > Documentation is currently located as README markdown files:
>> >
>> > * https://github.com/Parquet/parquet-format
>> > * https://github.com/Parquet/parquet-mr
>> >
>> > == Source and Intellectual Property Submission Plan ==
>> >
>> > The Parquet codebase is currently hosted on Github:
>> > https://github.com/Parquet.
>> >
>> > This is the exact codebase that we would migrate to the Apache
>> foundation.
>> >
>> > == External Dependencies ==
>> >
>> >  * Junit: EPL
>> >  * Apache Commons: ALv2
>> >  * Apache Thrift: ALv2
>> >  * Apache Maven: ALv2
>> >  * Apache Avro: ALv2
>> >  * Apache Hadoop: ALv2
>> >  * Google Guava: ALv2
>> >
>> > == Cryptography ==
>> >
>> > We do not expect Parquet to be a controlled export item due to the use of
>> > encryption.
>> >
>> > == Required Resources ==
>> >
>> > === Mailing lists ===
>> >
>> >  * parquet-dev
>> >  * parquet-user
>> >
>> > == Subversion Directory ==
>> >
>> > Git is the preferred source control system: git://git.apache.org/parquet
>> >
>> > == Issue Tracking ==
>> >
>> > JIRA: Parquet (PARQUET)
>> >
>> > == Initial Committers ==
>> >
>> >  * Aniket Mokashi
>> >  * Brock Noland
>> >  * Chris Aniszczyk <z...@twitter.com>
>> >  * Dmitriy Ryaboy <dmit...@twitter.com>
>> >  * Jake Farrell
>> >  * Julien Le Dem <jul...@apache.org>
>> >  * Lukas Nalezenec
>> >  * Marcel Kornacker
>> >  * Mickael Lacour
>> >  * Nong Li
>> >  * Remy Pecqueur
>> >  * Tianshuo Deng
>> >  * Tom White
>> >
>> > == Affiliations ==
>> >
>> >  * Aniket Mokashi - Twitter
>> >  * Brock Noland - Cloudera
>> >  * Chris Aniszczyk - Twitter
>> >  * Dmitriy Ryaboy - Twitter
>> >  * Jake Farrell
>> >  * Julien Le Dem - Twitter
>> >  * Lukas Nalezenec
>> >  * Marcel Kornacker - Cloudera
>> >  * Mickael Lacour - Criteo
>> >  * Nong Li - Cloudera
>> >  * Remy Pecqueur - Criteo
>> >  * Tianshuo Deng - Twitter
>> >  * Tom White - Cloudera
>> >
>> > == Sponsors ==
>> >
>> > === Champion ===
>> >
>> >  * Todd Lipcon
>> >
>> > === Nominated Mentors ===
>> >
>> >  * Tom White
>> >  * Chris Mattmann
>> >  * Jake Farrell
>> >
>> > === Sponsoring Entity ===
>> >
>> > The Apache Incubator
>> >
>> > --
>> > Cheers,
>> >
>> > Chris Aniszczyk
>> > http://aniszczyk.org
>> > +1 512 961 6719
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>>
>>
>
>
> --
> Cheers,
>
> Chris Aniszczyk
> http://aniszczyk.org
> +1 512 961 6719

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Parquet

Reply via email to