Re: [VOTE] Accept the Iceberg project for incubation

Kevin A. McGrail Tue, 13 Nov 2018 10:07:32 -0800

+1 (binding)

On 11/13/2018 12:40 PM, Julian Hyde wrote:
> +1 (binding)
>
> Julian
>
>
>> On Nov 13, 2018, at 9:28 AM, Arthur Wiedmer <art...@apache.org> wrote:
>>
>> +1
>>
>> (Non-binding)
>>
>> Best,
>> Arthur
>>
>> On Tue, Nov 13, 2018, 09:24 Hugo Louro <hmclo...@gmail.com wrote:
>>
>>> +1 (non-binding)
>>>
>>>> On Nov 13, 2018, at 9:19 AM, Owen O'Malley <owen.omal...@gmail.com>
>>> wrote:
>>>> +1 (binding)
>>>>
>>>>> On Tue, Nov 13, 2018 at 12:12 PM Dave Fisher <dave2w...@comcast.net>
>>> wrote:
>>>>> +1 (binding)
>>>>>
>>>>>> On Nov 13, 2018, at 9:10 AM, Matt Sicker <boa...@gmail.com> wrote:
>>>>>>
>>>>>> +1 binding
>>>>>>
>>>>>>> On Tue, 13 Nov 2018 at 11:09, Ryan Blue <b...@apache.org> wrote:
>>>>>>>
>>>>>>> +1 (binding)
>>>>>>>
>>>>>>>> On Tue, Nov 13, 2018 at 9:06 AM Ryan Blue <b...@apache.org> wrote:
>>>>>>>>
>>>>>>>> The discuss thread seems to have reached consensus, so I propose
>>>>>>> accepting
>>>>>>>> the Iceberg project for incubation.
>>>>>>>>
>>>>>>>> The proposal is copied below and in the wiki:
>>>>>>>> https://wiki.apache.org/incubator/IcebergProposal
>>>>>>>>
>>>>>>>> Please vote on whether to accept Iceberg in the next 72 hours:
>>>>>>>>
>>>>>>>> [ ] +1, accept Iceberg for incubation
>>>>>>>> [ ] -1, reject the Iceberg proposal because . . .
>>>>>>>>
>>>>>>>> Thank you for reviewing the proposal and voting,
>>>>>>>>
>>>>>>>> rb
>>>>>>>> ------------------------------
>>>>>>>> Iceberg Proposal Abstract
>>>>>>>>
>>>>>>>> Iceberg is a table format for large, slow-moving tabular data.
>>>>>>>>
>>>>>>>> It is designed to improve on the de-facto standard table layout built
>>>>>>> into
>>>>>>>> Apache Hive, Presto, and Apache Spark.
>>>>>>>> Proposal
>>>>>>>>
>>>>>>>> The purpose of Iceberg is to provide SQL-like tables that are backed
>>> by
>>>>>>>> large sets of data files. Iceberg is similar to the Hive table
>>> layout,
>>>>>>> the
>>>>>>>> de-facto standard structure used to track files in a table, but
>>>>> provides
>>>>>>>> additional guarantees and performance optimizations:
>>>>>>>>
>>>>>>>> - Atomicity - Each change to the table is will be complete or will
>>>>>>>> fail. “Do or do not. There is no try.”
>>>>>>>> - Snapshot isolation - Reads use one and only one snapshot of a
>>> table
>>>>>>>> at some time without holding a lock.
>>>>>>>> - Safe schema evolution - A table’s schema can change in
>>> well-defined
>>>>>>>> ways, without breaking older data files.
>>>>>>>> - Column projection - An engine may request a subset of the
>>> available
>>>>>>>> columns, including nested fields.
>>>>>>>> - Predicate pushdown - An engine can push filters into read planning
>>>>>>>> to improve performance using partition data and file-level
>>>>> statistics.
>>>>>>>> Iceberg does NOT define a new file format. All data is stored in
>>> Apache
>>>>>>>> Avro, Apache ORC, or Apache Parquet files.
>>>>>>>>
>>>>>>>> Additionally, Iceberg is designed to work well when data files are
>>>>> stored
>>>>>>>> in cloud blob stores, even when those systems provide weaker
>>> guarantees
>>>>>>>> than a file system, including:
>>>>>>>>
>>>>>>>> - Eventual consistency in the namespace
>>>>>>>> - High latency for directory listings
>>>>>>>> - No renames of objects
>>>>>>>> - No folder hierarchy
>>>>>>>>
>>>>>>>> Rationale
>>>>>>>>
>>>>>>>> Initial benchmarks show dramatic improvements in query planning. For
>>>>>>>> example, in Netflix’s Atlas use case, which stores time-series
>>> metrics
>>>>>>> from
>>>>>>>> Netflix runtime systems and 1 month is stored across 2.7 million
>>> files
>>>>> in
>>>>>>>> 2,688 partitions:
>>>>>>>>
>>>>>>>> - Hive table using Parquet:
>>>>>>>>    - 400k+ splits, not combined
>>>>>>>>    - Explain query: 9.6 minutes wall time (planning only)
>>>>>>>> - Iceberg table with partition filtering:
>>>>>>>>    - 15,218 splits, combined
>>>>>>>>    - Planning: 10 seconds
>>>>>>>>    - Query wall time: 13 minutes
>>>>>>>> - Iceberg table with partition and min/max filtering:
>>>>>>>>    - 412 splits
>>>>>>>>    - Planning: 25 seconds
>>>>>>>>    - Query wall time: 42 seconds
>>>>>>>>
>>>>>>>> These performance gains combined with the cross-engine compatibility
>>>>> are
>>>>>>> a
>>>>>>>> very compelling story.
>>>>>>>> Initial Goals
>>>>>>>>
>>>>>>>> The initial goal will be to move the existing codebase to Apache and
>>>>>>>> integrate with the Apache development process and infrastructure. A
>>>>>>> primary
>>>>>>>> goal of incubation will be to grow and diversify the Iceberg
>>> community.
>>>>>>> We
>>>>>>>> are well aware that the project community is largely comprised of
>>>>>>>> individuals from a single company. We aim to change that during
>>>>>>> incubation.
>>>>>>>> Current Status
>>>>>>>>
>>>>>>>> As previously mentioned, Iceberg is under active development at
>>>>> Netflix,
>>>>>>>> and is being used in processing large volumes of data in Amazon EC2.
>>>>>>>>
>>>>>>>> Iceberg license documentation is already based on Apache guidelines
>>> for
>>>>>>>> LICENSE and NOTICE content.
>>>>>>>> Meritocracy
>>>>>>>>
>>>>>>>> We value meritocracy and we understand that it is the basis for an
>>> open
>>>>>>>> community that encourages multiple companies and individuals to
>>>>>>> contribute
>>>>>>>> and be invested in the project’s future. We will encourage and
>>> monitor
>>>>>>>> participation and make sure to extend privileges and responsibilities
>>>>> to
>>>>>>>> all contributors.
>>>>>>>> Community
>>>>>>>>
>>>>>>>> Iceberg is currently being used by developers at Netflix and a
>>> growing
>>>>>>>> number of users are actively using it in production environments.
>>>>> Iceberg
>>>>>>>> has received contributions from developers working at Hortonworks,
>>>>>>> WeWork,
>>>>>>>> and Palantir. By bringing Iceberg to Apache we aim to assure current
>>>>> and
>>>>>>>> future contributors that the Iceberg community is meritocratic and
>>>>> open,
>>>>>>> in
>>>>>>>> order to broaden and diversity the user and developer community.
>>>>>>>> Core Developers
>>>>>>>>
>>>>>>>> Iceberg was initially developed at Netflix and is under active
>>>>>>>> development. We believe Netflix will be of interest to a broad range
>>> of
>>>>>>>> users and developers and that incubating the project at the ASF will
>>>>> help
>>>>>>>> us build a diverse, sustainable community.
>>>>>>>> Alignment
>>>>>>>>
>>>>>>>> Iceberg utilizes other Apache projects such as Avro, Hadoop, Hive,
>>> ORC,
>>>>>>>> Parquet, Pig, and Spark. We anticipate integration with additional
>>>>> Apache
>>>>>>>> projects as the Iceberg community and interest in the project grows.
>>>>>>>> Known Risks Orphaned Products
>>>>>>>>
>>>>>>>> Netflix is committed to the future development of Iceberg and
>>>>> understands
>>>>>>>> that graduation to a TLP, while preferable, is not the only positive
>>>>>>>> outcome of incubation.
>>>>>>>>
>>>>>>>> Should the Iceberg project be accepted by the Incubator, the
>>>>> prospective
>>>>>>>> PPMC would be willing to agree to a target incubation period of 2
>>> years
>>>>>>> or
>>>>>>>> less, knowing that every Incubator project incurs a certain cost in
>>>>> terms
>>>>>>>> of ASF infrastructure and volunteer time.
>>>>>>>> Inexperience with Open Source
>>>>>>>>
>>>>>>>> Three of the initial committers are Apache members and Incubator PMC
>>>>>>>> members. They will work with the other community members to teach
>>> them
>>>>>>> the
>>>>>>>> Apache Way.
>>>>>>>> Homogenous Developers
>>>>>>>>
>>>>>>>> The majority of the committers work at Netflix, though we are
>>> committed
>>>>>>> to
>>>>>>>> recruiting and developing additional committers from a wide spectrum
>>> of
>>>>>>>> industries and backgrounds.
>>>>>>>> Reliance on Salaried Developers
>>>>>>>>
>>>>>>>> It is expected that Iceberg development will occur on both salaried
>>>>> time
>>>>>>>> and on volunteer time, after hours. Most of the initial committers
>>> are
>>>>>>> paid
>>>>>>>> by Netflix to contribute to this project. However, they are all
>>>>>>> passionate
>>>>>>>> about the project, and we are both confident and hopeful that the
>>>>> project
>>>>>>>> will continue even if no salaried developers contribute to the
>>> project.
>>>>>>>> Relationships with Other Apache Products
>>>>>>>>
>>>>>>>> As mentioned in the Rationale section, Iceberg utilizes a number of
>>>>>>>> existing Apache projects (Avro, Hadoop, Hive, ORC, Parquet, Pig, &
>>>>>>> Spark),
>>>>>>>> and we expect that list to expand as the community grows and
>>>>> diversifies.
>>>>>>>> Any Apache project in the big data space that needs to store or
>>> process
>>>>>>>> tabular data would be potentially relevant.
>>>>>>>> An Excessive Fascination with the Apache Brand
>>>>>>>>
>>>>>>>> We are applying to the Incubator process because we think it is the
>>>>> next
>>>>>>>> logical step for the Iceberg project after open-sourcing the code.
>>> This
>>>>>>>> proposal is not for the purpose of generating publicity. Rather, we
>>>>> want
>>>>>>> to
>>>>>>>> make sure to create a very inclusive and meritocratic community,
>>>>> outside
>>>>>>>> the umbrella of a single company. Netflix has a long history of
>>>>>>>> contributing to Apache projects and the Iceberg developers and
>>>>>>> contributors
>>>>>>>> understand the implication of making it an Apache project.
>>>>>>>> Required Resources Mailing lists
>>>>>>>>
>>>>>>>> - d...@iceberg.incubator.apache.org
>>>>>>>> - comm...@iceberg.incubator.apache.org
>>>>>>>> - priv...@iceberg.incubator.apache.org
>>>>>>>>
>>>>>>>> The podling may also create a user mailing list, if needed.
>>>>>>>> Source Control and Issue Tracking
>>>>>>>>
>>>>>>>> The Iceberg podling would use Apache’s gitbox integration to sync
>>>>> between
>>>>>>>> github and Apache infrastructure. The podling would use github issues
>>>>> and
>>>>>>>> pull requests for community engagement.
>>>>>>>> Current Resources
>>>>>>>>
>>>>>>>> - Initial source: https://github.com/Netflix/iceberg
>>>>>>>> - Java documentation:
>>>>>>>>
>>> https://netflix.github.io/iceberg/current/javadoc/index.html?com/netflix/iceberg/package-summary.html
>>>>>>>> - Table specification:
>>>>>>>>
>>> https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit
>>>>>>>> Source and Intellectual Property Submission Plan
>>>>>>>>
>>>>>>>> The Iceberg source code in Github is currently licensed under Apache
>>>>>>>> License v2.0 and the copyright is assigned to Netflix. If Iceberg
>>>>> becomes
>>>>>>>> an Incubator project at the ASF, Netflix will transfer the source
>>> code
>>>>>>> and
>>>>>>>> trademark ownership to the Apache Software Foundation via a Software
>>>>>>> Grant
>>>>>>>> Agreement.
>>>>>>>> External Dependencies
>>>>>>>>
>>>>>>>> External dependencies licensed under Apache License 2.0
>>>>>>>>
>>>>>>>> - Guava https://github.com/google/guava
>>>>>>>> - Jackson https://github.com/FasterXML/jackson-core
>>>>>>>> - Joda-Time http://www.joda.org/joda-time/
>>>>>>>>
>>>>>>>> External dependencies licensed under the MIT License
>>>>>>>>
>>>>>>>> - SLF4J https://www.slf4j.org/
>>>>>>>> - Mockito https://github.com/mockito/mockito
>>>>>>>>
>>>>>>>> ASF Projects
>>>>>>>>
>>>>>>>> - Apache Avro
>>>>>>>> - Apache Hadoop
>>>>>>>> - Apache Hive
>>>>>>>> - Apache ORC
>>>>>>>> - Apache Parquet
>>>>>>>> - Apache Pig
>>>>>>>> - Apache Spark
>>>>>>>>
>>>>>>>> Cryptography
>>>>>>>>
>>>>>>>> We do not expect Iceberg to be a controlled export item due to the
>>> use
>>>>> of
>>>>>>>> encryption.
>>>>>>>> Initial Committers and Affiliations
>>>>>>>>
>>>>>>>> - Ryan Blue b...@apache.org (Netflix)
>>>>>>>> - Parth Brahmbhatt pa...@apache.org (Netflix)
>>>>>>>> - Julien Le Dem jul...@apache.org (WeWork)
>>>>>>>> - Owen O’Malley omal...@apache.org (Hortonworks)
>>>>>>>> - Daniel Weeks dwe...@apache.org (Netflix)
>>>>>>>>
>>>>>>>> Sponsors and Nominated Mentors
>>>>>>>>
>>>>>>>> - Champion and mentor: Owen O’Malley omal...@apache.org
>>>>>>>> - Mentor: Ryan Blue b...@apache.org
>>>>>>>> - Mentor: Julien Le Dem jul...@apache.org
>>>>>>>>
>>>>>>>> Sponsoring Entity
>>>>>>>>
>>>>>>>> The Apache Incubator
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Matt Sicker <boa...@gmail.com>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>


-- 
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] Accept the Iceberg project for incubation

Reply via email to