Thank you for creating the draft proposal, Andrew. I have reviewed this and I think it looks great.
Andy. On Wed, Dec 27, 2023 at 3:19 PM Andrew Lamb <al...@influxdata.com> wrote: > I have created a draft proposal [1] to break DataFusion out to its own top > level project. Please provide your feedback and suggestions. > > The proposal is included at the end of this email and in this Google Doc: > > https://docs.google.com/document/d/11WTNYS8KWScOt3ySTX39WVS6krPhUvHsuJRY9PZQx4g > . > > Feel free to respond to this email or comment / make suggestions directly > on the document. > > I would be especially grateful if people could review and comment on the > proposed list of committers and PMC members. > > I hope everyone is not getting sick of hearing about this, but I think in > this case it is better to over communicate than risk surprises. > > Andrew > > [1] https://github.com/apache/arrow-datafusion/issues/8491 > > > ---------- > > DataFusion Top Level Project Proposal > Dec 27, 2023 > > [Editor’s note: This document is based on the proposal to the ASF board to > create the Arrow project. One it is been reviewed, we plan to send it to > the ASF board sometime in January or February 2024 for their consideration] > > To: The ASF (bo...@apache.org) > > Summary: > > We propose creating a new top level project, Apache DataFusion, from an > existing sub project of Apache Arrow to facilitate additional community and > project growth. > > ---- > Apache DataFusion for Apache Top Level Project > > Abstract > > Apache Arrow DataFusion[1] is a very fast, extensible query engine for > building high-quality data-centric systems in Rust, using the Apache Arrow > in-memory format. DataFusion offers SQL and Dataframe APIs, excellent > performance, built-in support for CSV, Parquet, JSON, and Avro, extensive > customization, and a great community. > > [1] https://arrow.apache.org/datafusion/ > > > Proposal > > We propose creating a new top level ASF project, Apache DataFusion, > governed initially by a subset of the Arrow project’s PMC and committers. > The project’s code is in four existing git repositories, currently governed > by Apache Arrow which would transfer to the new top level project. > > Background > > When DataFusion was initially donated to the Arrow project, it did not have > a strong enough community to stand on its own. It has since grown > significantly, and benefited immensely from being part of Arrow and > nurturing of the Apache Way, and now has a community strong enough to stand > on its own and that would benefit from focused governance attention. > > The community has discussed this idea publicly for more than 6 months > https://github.com/apache/arrow-datafusion/discussions/6475 and briefly > on > the Arrow PMC mailing list > https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As of > the > time of this writing both had exclusively positive reactions. > > Several current members of the Arrow PMC are both active contributors to > DataFusion and understand and believe deeply in the Apache Way, and play > active governance roles in the Arrow project as PMC members and PMC chairs, > guiding the community, and releasing software versions. With this existing > governance experience and structure, the new top level project will be able > to function well immediately and independently. > > Overview of DataFusion > > Current Status > > Meritocracy > > DataFusion has been developed as part of Apache Arrow and thus has been > operating as a meritocracy. Many of the developers of DataFusion are Arrow > PMC members or committers. The DataFusion project plans to continue adding > new PMC and committers as the project matures and grows. > > Community > > The DataFusion development team seeks to foster the development and user > communities. We hope that becoming a separate project will help both Arrow > and DataFusion communities by being more focused. Focused governance will > make it easier to grow the community of committers and PMC members and make > the organization more clear to others. > > Alignment > > The ASF is a natural host for DataFusion given that it is already the home > of Arrow, Parquet, and other related distributed system, storage and query > execution systems. > > Project Leadership > > Proposed Initial PMC > > We propose the following people as the initial DataFusion PMC members. This > is a subset of the existing Arrow PMC members who contribute to DataFusion > https://people.apache.org/phonebook.html?unix=arrow > > Andy Grove (agrove): Arrow PMC Chair > Andrew Lamb (alamb): Arrow PMC, past Arrow PMC Chair > Daniël Heres (dheres) Arrow PMC > Jie Wen (jakevin): Arrow PMC, Doris Committer > Kun Liu (liukun): Arrow PMC, IoTDB PMC, TSFile PMC > Liang-Chi Hsieh (viirya): Arrow PMC, Spark PMC > Qingping Hou: (houqp): Arrow PMC, Doris Committer > Will Jones (wjones127): Arrow PMC > > We’d like to propose Andrew Lamb as the initial Chair, (and thus ASF VP) > for the DataFusion project. > > Affiliations > > Andy Grove (agrove): NVidia > Andrew Lamb (alamb): InfluxData > Daniël Heres (dheres): Coralogix > Jie Wen (jakevin): SelectDB > Kun Liu (liukun): Ebay > Liang-Chi Hsieh (viirya): Apple > Qingping Hou: (houqp): Scribd > Will Jones (wjones127): VoltronData > > Proposed Initial Committers > > In addition to the PMC, we propose the following people as the initial > DataFusion committers. This is a subset of the existing Arrow committers > who contribute to DataFusion > https://people.apache.org/phonebook.html?unix=arrow > > akurmustafa Mustafa Akur (Synnada) > avantgardner Brent Gardner (Coralogix) > comphead Oleks V. (Unaffiliated) > jiayuliu Liu Jiayu (Airbnb) > mete Metehan Yildirim (Synnada) > mingmwang Wang Mingming (Ebay) > mneumann Marco Neumann (InfluxData) > nju_yaho Zhong Yanghong (Ebay) > ozankabak Mehmet Ozan Kabak (Synnada) > paddyhoran Paddy Horan (Assured Allies) > rdettai Rémi Dettai (Cloudfuse) > sunchao Sun Chao (Apple) > thinkharderdev Daniel Harris (Coralogix) > tustvold Raphael Taylor-Davies (InfluxData) > viirya L. C. Hsieh (Apple) > wayne Ruihang Xia (Greptime) > xudong963 Xudong Wang (ByteDance) > yjshen Yijie Shen (Space and Time) > > > Risk Assessments > > Naming / Trademarks > > As a sub-project of Arrow, the DataFusion name has been used for over 4 > years without any known issues. A podling name search has thus far not > turned up any concerns: > https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219 > > Legal / IP Clearance > > All DataFusion code has either been donated to the Arrow project with > appropriate IP clearance or has been developed directly under ASF > processes and procedures. Thus creating a new top level project poses no > new Legal or IP risks. > > Code Extraction > > The relevant code is already in 4 separate repositories: > https://github.com/apache/arrow-datafusion/ > https://github.com/apache/arrow-datafusion-python > https://github.com/apache/arrow-ballista > https://github.com/apache/arrow-ballista-python > > We foresee no issues with code extraction and propose these repositories be > respectively renamed to reflect top level projects: > https://github.com/apache/datafusion/ > https://github.com/apache/datafusion-python > https://github.com/apache/datafusion-ballista > https://github.com/apache/datafusion-ballista-python > > Note: https://github.com/apache/arrow-rs, the Rust implementation of > Arrow, would remain part of the Arrow project. > > Orphaned Products > > DataFusion is known to be used in many open source and commercial projects > > https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users > , > has had multiple commits daily for several years, and its adoption and > number of contributors appears to be growing. > > Inexperience with Open Source > > The proposed PMC has extensive experience with Apache Arrow and other > Apache projects, and includes PMC members and PMC chairs. The DataFusion > PMC and more experienced committers will continue to coach new community > members who may be less familiar with the Apache Way. > > Homogeneous Developers > > The 8 proposed PMC members are from 8 different employers and the proposed > committers are similarly distributed across affiliations. No specific > entity employs more than 3 total proposed developers. > > Reliance on Salaried Developers > > A substantial amount of work on DataFusion has been by salaried developers, > but it also has a long tradition of attracting contributions from students > and hobbyists and we plan no changes in contribution structure. > > Relationships with Other Apache Products > > DataFusion will obviously have a strong relationship with the Arrow project > given the overlap in people. We don’t foresee close collaboration with > other projects at this time. > > Cryptography > > DataFusion does not directly support encryption and there are no near-term > plans to add support for encryption. Users who need this functionality can > use the extension APIs. > > Required Resources > > Mailing Lists > > - private@datafusion for private PMC discussions (with moderated > subscriptions) > - dev@datafusion > - commits@datafusion > > Version Control > > We propose to continue to use git for source control and gitub for hosting > and testing resources. > > Issue Tracking > > DataFusion would continue to use github for its issue tracking and > communications > > Other Resources > > The existing repositories already make use of existing Apache > infrastructure, and we expect no change in the initial resource usage. As > the project continues to grow, we expect continued infrastructure demand > growth. > > > FAQ: Has a sub project been promoted to a top level project before? > > Yes, and it appears to happen commonly. The Arrow project itself was > created as a top level project from work that started in Apache Drill, and > there are many sub projects of Hadoop that spun out as their own top level > projects such as Mahout, Avro and HBase: > > https://news.apache.org/foundation/entry/the_apache_software_foundation_announces4 > > > > > Related material: > Name search request / research for DataFusion: > https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219 > Discussion about which repositories on the arrow mailing list: > https://lists.apache.org/thread/ob3n0d9ky0bgrryl3xn39w9k566bq00q > Discussion about initial PMC on the arrow mailing list: > https://lists.apache.org/thread/pymrzcdw4qdptvby85f69rg3pcckl15b > Discussion about creating a new DataFusion top level project: > https://github.com/apache/arrow-datafusion/discussions/6475 > Discussion about graduating on incubator list: > https://lists.apache.org/thread/r4n73pmms1lv0jbohyx1o1z13d615t99 > Original Proposal for the Arrow project: > https://lists.apache.org/thread/x2qzdwglm8pkqp9gv03bbgw17khl7pq3 >