Thank you for creating the draft proposal, Andrew. I have reviewed this and
I think it looks great.

Andy.

On Wed, Dec 27, 2023 at 3:19 PM Andrew Lamb <al...@influxdata.com> wrote:

> I have created a draft proposal [1] to break DataFusion out to its own top
> level project. Please provide your feedback and suggestions.
>
> The proposal is included at the end of this email and in this Google Doc:
>
> https://docs.google.com/document/d/11WTNYS8KWScOt3ySTX39WVS6krPhUvHsuJRY9PZQx4g
> .
>
> Feel free to respond to this email or comment / make suggestions directly
> on the document.
>
> I would be especially grateful if people could review and comment on the
> proposed list of committers and PMC members.
>
> I hope everyone is not getting sick of hearing about this, but I think in
> this case it is better to over communicate than risk surprises.
>
> Andrew
>
> [1] https://github.com/apache/arrow-datafusion/issues/8491
>
>
> ----------
>
> DataFusion Top Level Project Proposal
> Dec 27, 2023
>
> [Editor’s note: This document is based on the proposal to the ASF board to
> create the Arrow project. One it is been reviewed, we plan to send it to
> the ASF board sometime in January or February 2024 for their consideration]
>
> To: The ASF (bo...@apache.org)
>
> Summary:
>
> We propose creating a new top level project, Apache DataFusion, from an
> existing sub project of Apache Arrow to facilitate additional community and
> project growth.
>
> ----
> Apache DataFusion for Apache Top Level Project
>
> Abstract
>
> Apache Arrow DataFusion[1]  is a very fast, extensible query engine for
> building high-quality data-centric systems in Rust, using the Apache Arrow
> in-memory format. DataFusion offers SQL and Dataframe APIs, excellent
> performance, built-in support for CSV, Parquet, JSON, and Avro, extensive
> customization, and a great community.
>
> [1] https://arrow.apache.org/datafusion/
>
>
> Proposal
>
> We propose creating a new top level ASF project, Apache DataFusion,
> governed initially by a subset of the Arrow project’s PMC and committers.
> The project’s code is in four existing git repositories, currently governed
> by Apache Arrow which would transfer to the new top level project.
>
> Background
>
> When DataFusion was initially donated to the Arrow project, it did not have
> a strong enough community to stand on its own. It has since grown
> significantly, and benefited immensely from being part of Arrow and
> nurturing of the Apache Way, and now has a community strong enough to stand
> on its own and that would benefit from focused governance attention.
>
> The community has discussed this idea publicly for more than 6 months
> https://github.com/apache/arrow-datafusion/discussions/6475  and briefly
> on
> the Arrow PMC mailing list
> https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As of
> the
> time of this writing both had exclusively positive reactions.
>
> Several current members of the Arrow PMC are both active contributors to
> DataFusion and understand and believe deeply in the Apache Way, and play
> active governance roles in the Arrow project as PMC members and PMC chairs,
> guiding the community, and releasing software versions. With this existing
> governance experience and structure, the new top level project will be able
> to function well immediately and independently.
>
> Overview of DataFusion
>
> Current Status
>
> Meritocracy
>
> DataFusion has been developed as part of Apache Arrow and thus has been
> operating as a meritocracy. Many of the developers of DataFusion are Arrow
> PMC members or committers. The DataFusion project plans to continue adding
> new PMC and committers as the project matures and grows.
>
> Community
>
> The DataFusion development team seeks to foster the development and user
> communities. We hope that becoming a separate project will help both Arrow
> and DataFusion communities by being more focused.  Focused governance will
> make it easier to grow the community of committers and PMC members and make
> the organization more clear to others.
>
> Alignment
>
> The ASF is a natural host for DataFusion given that it is already the home
> of Arrow, Parquet, and other related distributed system, storage and query
> execution systems.
>
> Project Leadership
>
> Proposed Initial PMC
>
> We propose the following people as the initial DataFusion PMC members. This
> is a subset of the existing Arrow PMC members who contribute to DataFusion
> https://people.apache.org/phonebook.html?unix=arrow
>
> Andy Grove (agrove):  Arrow PMC Chair
> Andrew Lamb (alamb): Arrow PMC, past Arrow PMC Chair
> Daniël Heres (dheres) Arrow PMC
> Jie Wen (jakevin):  Arrow PMC, Doris Committer
> Kun Liu (liukun): Arrow PMC, IoTDB PMC, TSFile PMC
> Liang-Chi Hsieh (viirya): Arrow PMC, Spark PMC
> Qingping Hou: (houqp): Arrow PMC, Doris Committer
> Will Jones (wjones127): Arrow PMC
>
> We’d like to propose Andrew Lamb as the initial Chair, (and thus ASF VP)
> for the DataFusion project.
>
> Affiliations
>
> Andy Grove (agrove):  NVidia
> Andrew Lamb (alamb): InfluxData
> Daniël Heres (dheres): Coralogix
> Jie Wen (jakevin): SelectDB
> Kun Liu (liukun): Ebay
> Liang-Chi Hsieh (viirya): Apple
> Qingping Hou: (houqp): Scribd
> Will Jones (wjones127): VoltronData
>
> Proposed Initial Committers
>
> In addition to the PMC, we propose the following people as the initial
> DataFusion committers. This is a subset of the existing Arrow committers
> who contribute to DataFusion
> https://people.apache.org/phonebook.html?unix=arrow
>
> akurmustafa Mustafa Akur (Synnada)
> avantgardner Brent Gardner (Coralogix)
> comphead Oleks V. (Unaffiliated)
> jiayuliu Liu Jiayu (Airbnb)
> mete Metehan Yildirim (Synnada)
> mingmwang Wang Mingming (Ebay)
> mneumann Marco Neumann (InfluxData)
> nju_yaho Zhong Yanghong (Ebay)
> ozankabak Mehmet Ozan Kabak (Synnada)
> paddyhoran Paddy Horan (Assured Allies)
> rdettai Rémi Dettai (Cloudfuse)
> sunchao Sun Chao (Apple)
> thinkharderdev Daniel Harris (Coralogix)
> tustvold Raphael Taylor-Davies (InfluxData)
> viirya L. C. Hsieh (Apple)
> wayne Ruihang Xia (Greptime)
> xudong963 Xudong Wang (ByteDance)
> yjshen Yijie Shen (Space and Time)
>
>
> Risk Assessments
>
> Naming / Trademarks
>
> As a sub-project of Arrow, the DataFusion name has been used for over 4
> years without any known issues. A podling name search has thus far not
> turned up any concerns:
> https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219
>
> Legal / IP Clearance
>
> All DataFusion code has either been donated to the Arrow project with
> appropriate IP clearance or  has been developed directly under ASF
> processes and procedures. Thus creating a new top level project poses no
> new Legal or IP risks.
>
> Code Extraction
>
> The relevant code is already in 4 separate repositories:
> https://github.com/apache/arrow-datafusion/
> https://github.com/apache/arrow-datafusion-python
> https://github.com/apache/arrow-ballista
> https://github.com/apache/arrow-ballista-python
>
> We foresee no issues with code extraction and propose these repositories be
> respectively  renamed to reflect top level projects:
> https://github.com/apache/datafusion/
> https://github.com/apache/datafusion-python
> https://github.com/apache/datafusion-ballista
> https://github.com/apache/datafusion-ballista-python
>
> Note:  https://github.com/apache/arrow-rs, the Rust implementation of
> Arrow, would remain part of the Arrow project.
>
> Orphaned Products
>
> DataFusion is known to be used in many open source and commercial projects
>
> https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users
> ,
> has had multiple commits daily for several years, and its adoption and
> number of contributors appears to be growing.
>
> Inexperience with Open Source
>
> The proposed PMC has extensive experience with Apache Arrow and other
> Apache projects, and includes PMC members and PMC chairs. The DataFusion
> PMC and more experienced committers will continue to coach new community
> members who may be less familiar with the Apache Way.
>
> Homogeneous Developers
>
> The 8 proposed PMC members are from 8 different employers and the proposed
> committers are similarly distributed across affiliations. No specific
> entity employs more than 3 total proposed developers.
>
> Reliance on Salaried Developers
>
> A substantial amount of work on DataFusion has been by salaried developers,
> but it also has a long tradition of attracting contributions from students
> and hobbyists and we plan no changes in contribution structure.
>
> Relationships with Other Apache Products
>
> DataFusion will obviously have a strong relationship with the Arrow project
> given the overlap in people. We don’t foresee close collaboration with
> other projects at this time.
>
> Cryptography
>
> DataFusion does not directly support encryption and there are no near-term
> plans to add support for encryption. Users who need this functionality can
> use the extension APIs.
>
> Required Resources
>
> Mailing Lists
>
> - private@datafusion for private PMC discussions (with moderated
> subscriptions)
> - dev@datafusion
> - commits@datafusion
>
> Version Control
>
> We propose to continue to use git for source control and gitub for hosting
> and testing resources.
>
> Issue Tracking
>
> DataFusion would continue to use github for its issue tracking and
> communications
>
> Other Resources
>
> The existing repositories already make use of existing Apache
> infrastructure, and we expect no change in the initial resource usage. As
> the project continues to grow, we expect continued infrastructure demand
> growth.
>
>
> FAQ: Has a sub project been promoted to a top level project before?
>
> Yes, and it appears to happen commonly. The Arrow project itself was
> created as a top level project from work that started in Apache Drill, and
> there are many sub projects of Hadoop that spun out as their own top level
> projects such as Mahout, Avro and HBase:
>
> https://news.apache.org/foundation/entry/the_apache_software_foundation_announces4
>
>
>
>
> Related material:
> Name search request / research for DataFusion:
> https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219
> Discussion about which repositories on the arrow mailing list:
> https://lists.apache.org/thread/ob3n0d9ky0bgrryl3xn39w9k566bq00q
> Discussion about initial PMC on the arrow mailing list:
> https://lists.apache.org/thread/pymrzcdw4qdptvby85f69rg3pcckl15b
> Discussion about creating a new DataFusion top level project:
> https://github.com/apache/arrow-datafusion/discussions/6475
> Discussion about graduating on incubator list:
> https://lists.apache.org/thread/r4n73pmms1lv0jbohyx1o1z13d615t99
> Original Proposal for the Arrow project:
> https://lists.apache.org/thread/x2qzdwglm8pkqp9gv03bbgw17khl7pq3
>

Reply via email to