I have created a draft proposal [1] to break DataFusion out to its own top
level project. Please provide your feedback and suggestions.

The proposal is included at the end of this email and in this Google Doc:
https://docs.google.com/document/d/11WTNYS8KWScOt3ySTX39WVS6krPhUvHsuJRY9PZQx4g
.

Feel free to respond to this email or comment / make suggestions directly
on the document.

I would be especially grateful if people could review and comment on the
proposed list of committers and PMC members.

I hope everyone is not getting sick of hearing about this, but I think in
this case it is better to over communicate than risk surprises.

Andrew

[1] https://github.com/apache/arrow-datafusion/issues/8491


----------

DataFusion Top Level Project Proposal
Dec 27, 2023

[Editor’s note: This document is based on the proposal to the ASF board to
create the Arrow project. One it is been reviewed, we plan to send it to
the ASF board sometime in January or February 2024 for their consideration]

To: The ASF (bo...@apache.org)

Summary:

We propose creating a new top level project, Apache DataFusion, from an
existing sub project of Apache Arrow to facilitate additional community and
project growth.

----
Apache DataFusion for Apache Top Level Project

Abstract

Apache Arrow DataFusion[1]  is a very fast, extensible query engine for
building high-quality data-centric systems in Rust, using the Apache Arrow
in-memory format. DataFusion offers SQL and Dataframe APIs, excellent
performance, built-in support for CSV, Parquet, JSON, and Avro, extensive
customization, and a great community.

[1] https://arrow.apache.org/datafusion/


Proposal

We propose creating a new top level ASF project, Apache DataFusion,
governed initially by a subset of the Arrow project’s PMC and committers.
The project’s code is in four existing git repositories, currently governed
by Apache Arrow which would transfer to the new top level project.

Background

When DataFusion was initially donated to the Arrow project, it did not have
a strong enough community to stand on its own. It has since grown
significantly, and benefited immensely from being part of Arrow and
nurturing of the Apache Way, and now has a community strong enough to stand
on its own and that would benefit from focused governance attention.

The community has discussed this idea publicly for more than 6 months
https://github.com/apache/arrow-datafusion/discussions/6475  and briefly on
the Arrow PMC mailing list
https://lists.apache.org/thread/thv2jdm6640l6gm88hy8jhk5prjww0cs. As of the
time of this writing both had exclusively positive reactions.

Several current members of the Arrow PMC are both active contributors to
DataFusion and understand and believe deeply in the Apache Way, and play
active governance roles in the Arrow project as PMC members and PMC chairs,
guiding the community, and releasing software versions. With this existing
governance experience and structure, the new top level project will be able
to function well immediately and independently.

Overview of DataFusion

Current Status

Meritocracy

DataFusion has been developed as part of Apache Arrow and thus has been
operating as a meritocracy. Many of the developers of DataFusion are Arrow
PMC members or committers. The DataFusion project plans to continue adding
new PMC and committers as the project matures and grows.

Community

The DataFusion development team seeks to foster the development and user
communities. We hope that becoming a separate project will help both Arrow
and DataFusion communities by being more focused.  Focused governance will
make it easier to grow the community of committers and PMC members and make
the organization more clear to others.

Alignment

The ASF is a natural host for DataFusion given that it is already the home
of Arrow, Parquet, and other related distributed system, storage and query
execution systems.

Project Leadership

Proposed Initial PMC

We propose the following people as the initial DataFusion PMC members. This
is a subset of the existing Arrow PMC members who contribute to DataFusion
https://people.apache.org/phonebook.html?unix=arrow

Andy Grove (agrove):  Arrow PMC Chair
Andrew Lamb (alamb): Arrow PMC, past Arrow PMC Chair
Daniël Heres (dheres) Arrow PMC
Jie Wen (jakevin):  Arrow PMC, Doris Committer
Kun Liu (liukun): Arrow PMC, IoTDB PMC, TSFile PMC
Liang-Chi Hsieh (viirya): Arrow PMC, Spark PMC
Qingping Hou: (houqp): Arrow PMC, Doris Committer
Will Jones (wjones127): Arrow PMC

We’d like to propose Andrew Lamb as the initial Chair, (and thus ASF VP)
for the DataFusion project.

Affiliations

Andy Grove (agrove):  NVidia
Andrew Lamb (alamb): InfluxData
Daniël Heres (dheres): Coralogix
Jie Wen (jakevin): SelectDB
Kun Liu (liukun): Ebay
Liang-Chi Hsieh (viirya): Apple
Qingping Hou: (houqp): Scribd
Will Jones (wjones127): VoltronData

Proposed Initial Committers

In addition to the PMC, we propose the following people as the initial
DataFusion committers. This is a subset of the existing Arrow committers
who contribute to DataFusion
https://people.apache.org/phonebook.html?unix=arrow

akurmustafa Mustafa Akur (Synnada)
avantgardner Brent Gardner (Coralogix)
comphead Oleks V. (Unaffiliated)
jiayuliu Liu Jiayu (Airbnb)
mete Metehan Yildirim (Synnada)
mingmwang Wang Mingming (Ebay)
mneumann Marco Neumann (InfluxData)
nju_yaho Zhong Yanghong (Ebay)
ozankabak Mehmet Ozan Kabak (Synnada)
paddyhoran Paddy Horan (Assured Allies)
rdettai Rémi Dettai (Cloudfuse)
sunchao Sun Chao (Apple)
thinkharderdev Daniel Harris (Coralogix)
tustvold Raphael Taylor-Davies (InfluxData)
viirya L. C. Hsieh (Apple)
wayne Ruihang Xia (Greptime)
xudong963 Xudong Wang (ByteDance)
yjshen Yijie Shen (Space and Time)


Risk Assessments

Naming / Trademarks

As a sub-project of Arrow, the DataFusion name has been used for over 4
years without any known issues. A podling name search has thus far not
turned up any concerns:
https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219

Legal / IP Clearance

All DataFusion code has either been donated to the Arrow project with
appropriate IP clearance or  has been developed directly under ASF
processes and procedures. Thus creating a new top level project poses no
new Legal or IP risks.

Code Extraction

The relevant code is already in 4 separate repositories:
https://github.com/apache/arrow-datafusion/
https://github.com/apache/arrow-datafusion-python
https://github.com/apache/arrow-ballista
https://github.com/apache/arrow-ballista-python

We foresee no issues with code extraction and propose these repositories be
respectively  renamed to reflect top level projects:
https://github.com/apache/datafusion/
https://github.com/apache/datafusion-python
https://github.com/apache/datafusion-ballista
https://github.com/apache/datafusion-ballista-python

Note:  https://github.com/apache/arrow-rs, the Rust implementation of
Arrow, would remain part of the Arrow project.

Orphaned Products

DataFusion is known to be used in many open source and commercial projects
https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users,
has had multiple commits daily for several years, and its adoption and
number of contributors appears to be growing.

Inexperience with Open Source

The proposed PMC has extensive experience with Apache Arrow and other
Apache projects, and includes PMC members and PMC chairs. The DataFusion
PMC and more experienced committers will continue to coach new community
members who may be less familiar with the Apache Way.

Homogeneous Developers

The 8 proposed PMC members are from 8 different employers and the proposed
committers are similarly distributed across affiliations. No specific
entity employs more than 3 total proposed developers.

Reliance on Salaried Developers

A substantial amount of work on DataFusion has been by salaried developers,
but it also has a long tradition of attracting contributions from students
and hobbyists and we plan no changes in contribution structure.

Relationships with Other Apache Products

DataFusion will obviously have a strong relationship with the Arrow project
given the overlap in people. We don’t foresee close collaboration with
other projects at this time.

Cryptography

DataFusion does not directly support encryption and there are no near-term
plans to add support for encryption. Users who need this functionality can
use the extension APIs.

Required Resources

Mailing Lists

- private@datafusion for private PMC discussions (with moderated
subscriptions)
- dev@datafusion
- commits@datafusion

Version Control

We propose to continue to use git for source control and gitub for hosting
and testing resources.

Issue Tracking

DataFusion would continue to use github for its issue tracking and
communications

Other Resources

The existing repositories already make use of existing Apache
infrastructure, and we expect no change in the initial resource usage. As
the project continues to grow, we expect continued infrastructure demand
growth.


FAQ: Has a sub project been promoted to a top level project before?

Yes, and it appears to happen commonly. The Arrow project itself was
created as a top level project from work that started in Apache Drill, and
there are many sub projects of Hadoop that spun out as their own top level
projects such as Mahout, Avro and HBase:
https://news.apache.org/foundation/entry/the_apache_software_foundation_announces4




Related material:
Name search request / research for DataFusion:
https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-219
Discussion about which repositories on the arrow mailing list:
https://lists.apache.org/thread/ob3n0d9ky0bgrryl3xn39w9k566bq00q
Discussion about initial PMC on the arrow mailing list:
https://lists.apache.org/thread/pymrzcdw4qdptvby85f69rg3pcckl15b
Discussion about creating a new DataFusion top level project:
https://github.com/apache/arrow-datafusion/discussions/6475
Discussion about graduating on incubator list:
https://lists.apache.org/thread/r4n73pmms1lv0jbohyx1o1z13d615t99
Original Proposal for the Arrow project:
https://lists.apache.org/thread/x2qzdwglm8pkqp9gv03bbgw17khl7pq3

Reply via email to