I have created a draft proposal [1] to break DataFusion out to its own top level project. Please provide your feedback and suggestions.
The proposal is included at the end of this email and in this Google Doc: https://docs.google.com/document/d/11WTNYS8KWScOt3ySTX39WVS6krPhUvHsuJRY9PZQx4g . Feel free to respond to this email or comment / make suggestions directly on the document. I would be especially grateful if people could review and comment on the proposed list of committers and PMC members. I hope everyone is not getting sick of hearing about this, but I think in this case it is better to over communicate than risk surprises. Andrew [1] https://github.com/apache/arrow-datafusion/issues/8491 ---------- DataFusion Top Level Project Proposal Dec 27, 2023 [Editor’s note: This document is based on the proposal to the ASF board to create the Arrow project. One it is been reviewed, we plan to send it to the ASF board sometime in January or February 2024 for their consideration] To: The ASF (bo...@apache.org) Summary: We propose creating a new top level project, Apache DataFusion, from an existing sub project of Apache Arrow to facilitate additional community and project growth. ---- Apache DataFusion for Apache Top Level Project Abstract Apache Arrow DataFusion[1] is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format. DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. [1] https://arrow.apache.org/datafusion/ Proposal We propose creating a new top level ASF project, Apache DataFusion, governed initially by a subset of the Arrow project’s PMC and committers. The project’s code is in four existing git repositories, currently governed by Apache Arrow which would transfer to the new top level project. Background When DataFusion was initially donated to the Arrow project, it did not have a strong enough community to stand on its own. It has since grown significantly, and benefited immensely from being part of Arrow and nurturing of the Apache Way, and now has a community strong enough to stand on its own and that would benefit from focused governance attention. The community has discussed this idea publicly for more than 6 months https://github.com/apache/arrow-datafusion/discussions/6475 and briefly on the Arrow PMC mailing list https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As of the time of this writing both had exclusively positive reactions. Several current members of the Arrow PMC are both active contributors to DataFusion and understand and believe deeply in the Apache Way, and play active governance roles in the Arrow project as PMC members and PMC chairs, guiding the community, and releasing software versions. With this existing governance experience and structure, the new top level project will be able to function well immediately and independently. Overview of DataFusion Current Status Meritocracy DataFusion has been developed as part of Apache Arrow and thus has been operating as a meritocracy. Many of the developers of DataFusion are Arrow PMC members or committers. The DataFusion project plans to continue adding new PMC and committers as the project matures and grows. Community The DataFusion development team seeks to foster the development and user communities. We hope that becoming a separate project will help both Arrow and DataFusion communities by being more focused. Focused governance will make it easier to grow the community of committers and PMC members and make the organization more clear to others. Alignment The ASF is a natural host for DataFusion given that it is already the home of Arrow, Parquet, and other related distributed system, storage and query execution systems. Project Leadership Proposed Initial PMC We propose the following people as the initial DataFusion PMC members. This is a subset of the existing Arrow PMC members who contribute to DataFusion https://people.apache.org/phonebook.html?unix=arrow Andy Grove (agrove): Arrow PMC Chair Andrew Lamb (alamb): Arrow PMC, past Arrow PMC Chair Daniël Heres (dheres) Arrow PMC Jie Wen (jakevin): Arrow PMC, Doris Committer Kun Liu (liukun): Arrow PMC, IoTDB PMC, TSFile PMC Liang-Chi Hsieh (viirya): Arrow PMC, Spark PMC Qingping Hou: (houqp): Arrow PMC, Doris Committer Will Jones (wjones127): Arrow PMC We’d like to propose Andrew Lamb as the initial Chair, (and thus ASF VP) for the DataFusion project. Affiliations Andy Grove (agrove): NVidia Andrew Lamb (alamb): InfluxData Daniël Heres (dheres): Coralogix Jie Wen (jakevin): SelectDB Kun Liu (liukun): Ebay Liang-Chi Hsieh (viirya): Apple Qingping Hou: (houqp): Scribd Will Jones (wjones127): VoltronData Proposed Initial Committers In addition to the PMC, we propose the following people as the initial DataFusion committers. This is a subset of the existing Arrow committers who contribute to DataFusion https://people.apache.org/phonebook.html?unix=arrow akurmustafa Mustafa Akur (Synnada) avantgardner Brent Gardner (Coralogix) comphead Oleks V. (Unaffiliated) jiayuliu Liu Jiayu (Airbnb) mete Metehan Yildirim (Synnada) mingmwang Wang Mingming (Ebay) mneumann Marco Neumann (InfluxData) nju_yaho Zhong Yanghong (Ebay) ozankabak Mehmet Ozan Kabak (Synnada) paddyhoran Paddy Horan (Assured Allies) rdettai Rémi Dettai (Cloudfuse) sunchao Sun Chao (Apple) thinkharderdev Daniel Harris (Coralogix) tustvold Raphael Taylor-Davies (InfluxData) viirya L. C. Hsieh (Apple) wayne Ruihang Xia (Greptime) xudong963 Xudong Wang (ByteDance) yjshen Yijie Shen (Space and Time) Risk Assessments Naming / Trademarks As a sub-project of Arrow, the DataFusion name has been used for over 4 years without any known issues. A podling name search has thus far not turned up any concerns: https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219 Legal / IP Clearance All DataFusion code has either been donated to the Arrow project with appropriate IP clearance or has been developed directly under ASF processes and procedures. Thus creating a new top level project poses no new Legal or IP risks. Code Extraction The relevant code is already in 4 separate repositories: https://github.com/apache/arrow-datafusion/ https://github.com/apache/arrow-datafusion-python https://github.com/apache/arrow-ballista https://github.com/apache/arrow-ballista-python We foresee no issues with code extraction and propose these repositories be respectively renamed to reflect top level projects: https://github.com/apache/datafusion/ https://github.com/apache/datafusion-python https://github.com/apache/datafusion-ballista https://github.com/apache/datafusion-ballista-python Note: https://github.com/apache/arrow-rs, the Rust implementation of Arrow, would remain part of the Arrow project. Orphaned Products DataFusion is known to be used in many open source and commercial projects https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users, has had multiple commits daily for several years, and its adoption and number of contributors appears to be growing. Inexperience with Open Source The proposed PMC has extensive experience with Apache Arrow and other Apache projects, and includes PMC members and PMC chairs. The DataFusion PMC and more experienced committers will continue to coach new community members who may be less familiar with the Apache Way. Homogeneous Developers The 8 proposed PMC members are from 8 different employers and the proposed committers are similarly distributed across affiliations. No specific entity employs more than 3 total proposed developers. Reliance on Salaried Developers A substantial amount of work on DataFusion has been by salaried developers, but it also has a long tradition of attracting contributions from students and hobbyists and we plan no changes in contribution structure. Relationships with Other Apache Products DataFusion will obviously have a strong relationship with the Arrow project given the overlap in people. We don’t foresee close collaboration with other projects at this time. Cryptography DataFusion does not directly support encryption and there are no near-term plans to add support for encryption. Users who need this functionality can use the extension APIs. Required Resources Mailing Lists - private@datafusion for private PMC discussions (with moderated subscriptions) - dev@datafusion - commits@datafusion Version Control We propose to continue to use git for source control and gitub for hosting and testing resources. Issue Tracking DataFusion would continue to use github for its issue tracking and communications Other Resources The existing repositories already make use of existing Apache infrastructure, and we expect no change in the initial resource usage. As the project continues to grow, we expect continued infrastructure demand growth. FAQ: Has a sub project been promoted to a top level project before? Yes, and it appears to happen commonly. The Arrow project itself was created as a top level project from work that started in Apache Drill, and there are many sub projects of Hadoop that spun out as their own top level projects such as Mahout, Avro and HBase: https://news.apache.org/foundation/entry/the_apache_software_foundation_announces4 Related material: Name search request / research for DataFusion: https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219 Discussion about which repositories on the arrow mailing list: https://lists.apache.org/thread/ob3n0d9ky0bgrryl3xn39w9k566bq00q Discussion about initial PMC on the arrow mailing list: https://lists.apache.org/thread/pymrzcdw4qdptvby85f69rg3pcckl15b Discussion about creating a new DataFusion top level project: https://github.com/apache/arrow-datafusion/discussions/6475 Discussion about graduating on incubator list: https://lists.apache.org/thread/r4n73pmms1lv0jbohyx1o1z13d615t99 Original Proposal for the Arrow project: https://lists.apache.org/thread/x2qzdwglm8pkqp9gv03bbgw17khl7pq3