ntjohnson1 commented on code in PR #21105:
URL: https://github.com/apache/datafusion/pull/21105#discussion_r2996981885
##########
dev/wiki/apache-datafusion.wikitext:
##########
@@ -0,0 +1,113 @@
+<!--
+Draft Wikipedia article.
+-->
+
+{{Short description|Open-source query engine}}
+{{Draft topics|technology|software}}
+{{Infobox software
+| name = Apache DataFusion
+| developer = [[Apache Software Foundation]]
+| programming language = [[Rust (programming language)|Rust]]
+| genre = Query engine
+| license = [[Apache License]]
+| website = {{URL|https://datafusion.apache.org/}}
+}}
+
+'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable
analytical query engine written in [[Rust (programming language)|Rust]], built
on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite
journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres
|first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet
Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow
DataFusion: A Fast, Embeddable, Modular Analytic Query Engine
|journal=Proceedings of the 2024 International Conference on Management of Data
|year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite
web |title=Introduction
|url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache
DataFusion |publisher=Apache Software Foundation
|access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces
for analytical query execution and is designed to be used as a library by
develop
ers building databases, query engines, and analytical tools, rather than as a
standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" />
The project originated in 2017, was donated to the [[Apache Arrow]] project in
2019, and became a top-level project of the [[Apache Software Foundation]] in
2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native
Query Engine for Apache Arrow
|url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/
|website=Apache DataFusion Blog |publisher=Apache Software Foundation
|date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web
|title=Apache Software Foundation Announces New Top-Level Project Apache
DataFusion
|url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
|website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11
|access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded on
e million monthly downloads on crates.io.<ref name="crates-io">{{cite web
|title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io
|access-date=2026-03-26}}</ref>
Review Comment:
There isn't a formal page on Dataframes but there is a stub that refers to
Spark, pandas, etc. After this page lands we should add a pointer to it from
there. https://en.wikipedia.org/wiki/Dataframe
##########
dev/wiki/apache-datafusion.wikitext:
##########
@@ -0,0 +1,113 @@
+<!--
+Draft Wikipedia article.
+-->
+
+{{Short description|Open-source query engine}}
+{{Draft topics|technology|software}}
+{{Infobox software
+| name = Apache DataFusion
+| developer = [[Apache Software Foundation]]
+| programming language = [[Rust (programming language)|Rust]]
+| genre = Query engine
+| license = [[Apache License]]
+| website = {{URL|https://datafusion.apache.org/}}
+}}
+
+'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable
analytical query engine written in [[Rust (programming language)|Rust]], built
on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite
journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres
|first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet
Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow
DataFusion: A Fast, Embeddable, Modular Analytic Query Engine
|journal=Proceedings of the 2024 International Conference on Management of Data
|year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite
web |title=Introduction
|url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache
DataFusion |publisher=Apache Software Foundation
|access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces
for analytical query execution and is designed to be used as a library by
develop
ers building databases, query engines, and analytical tools, rather than as a
standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" />
The project originated in 2017, was donated to the [[Apache Arrow]] project in
2019, and became a top-level project of the [[Apache Software Foundation]] in
2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native
Query Engine for Apache Arrow
|url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/
|website=Apache DataFusion Blog |publisher=Apache Software Foundation
|date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web
|title=Apache Software Foundation Announces New Top-Level Project Apache
DataFusion
|url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
|website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11
|access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded on
e million monthly downloads on crates.io.<ref name="crates-io">{{cite web
|title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io
|access-date=2026-03-26}}</ref>
Review Comment:
NIT: I think `extensible analytical query engine` is clearer than
`embeddable analytical query engine`. Extensible is what is listed on the
landing page for datafusion on apache.org
##########
dev/wiki/apache-datafusion.wikitext:
##########
@@ -0,0 +1,113 @@
+<!--
+Draft Wikipedia article.
+-->
+
+{{Short description|Open-source query engine}}
+{{Draft topics|technology|software}}
+{{Infobox software
+| name = Apache DataFusion
+| developer = [[Apache Software Foundation]]
+| programming language = [[Rust (programming language)|Rust]]
+| genre = Query engine
+| license = [[Apache License]]
+| website = {{URL|https://datafusion.apache.org/}}
+}}
+
+'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable
analytical query engine written in [[Rust (programming language)|Rust]], built
on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite
journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres
|first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet
Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow
DataFusion: A Fast, Embeddable, Modular Analytic Query Engine
|journal=Proceedings of the 2024 International Conference on Management of Data
|year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite
web |title=Introduction
|url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache
DataFusion |publisher=Apache Software Foundation
|access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces
for analytical query execution and is designed to be used as a library by
develop
ers building databases, query engines, and analytical tools, rather than as a
standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" />
The project originated in 2017, was donated to the [[Apache Arrow]] project in
2019, and became a top-level project of the [[Apache Software Foundation]] in
2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native
Query Engine for Apache Arrow
|url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/
|website=Apache DataFusion Blog |publisher=Apache Software Foundation
|date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web
|title=Apache Software Foundation Announces New Top-Level Project Apache
DataFusion
|url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
|website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11
|access-date=2026-03-22}}</ref> As of March 2026, DataFusion exceeded on
e million monthly downloads on crates.io.<ref name="crates-io">{{cite web
|title=datafusion |url=https://crates.io/crates/datafusion |website=crates.io
|access-date=2026-03-26}}</ref>
+
+== History ==
+
+DataFusion originally authored by Andy Grove starting in 2017. It was donated
to the Apache Arrow Project in February 2019.<ref name="donation-post" /> In
2024, a paper describing DataFusion was accepted to the industry track of the
[[ACM SIGMOD]] conference.<ref name="sigmod-accepted">{{cite web |title=SIGMOD
2024 Industrial Track: Accepted Papers
|url=https://2024.sigmod.org/industrial-list.shtml |website=SIGMOD 2024
|access-date=2026-03-22}}</ref><ref name="sigmod-paper" /> In April 2024, the
project graduated from Apache Arrow and became a top-level Apache project.<ref
name="asf-tlp" />
+
+== Features ==
+
+DataFusion is a fast, extensible query engine for building data systems. It
provides a SQL interface and a DataFrame API for constructing queries
programmatically, a [[query plan|query planner]] and rule-based [[query
optimization|optimizer]], and a multithreaded vectorized execution engine that
processes data in columnar batches rather than row by row.<ref
name="sigmod-paper" /><ref name="intro-docs" />
+
+The engine reads common analytical file formats natively, including [[Apache
Parquet]], [[comma-separated values|CSV]], [[JSON]], [[Apache Avro|Avro]], and
Arrow IPC, and uses [[Apache Arrow]]'s columnar memory format throughout
execution, avoiding [[serialization]] overhead between stages.<ref
name="sigmod-paper" />
+
+DataFusion is designed for in-process embedding: it runs within the host
application's process rather than as a separate server, using threads for
parallel query execution. Its extension points allow downstream systems to add
[[user-defined function|user-defined functions]], custom data sources, custom
query languages, and new optimizer rules, enabling developers to build
specialized database systems on top of DataFusion's planning and execution
components without reimplementing them.<ref name="sigmod-paper" /><ref
name="intro-docs" />
+
+== Comparison with related systems ==
+
+DataFusion is frequently compared with other columnar analytical systems
including [[DuckDB]], [[Polars (software)|Polars]], and Velox, but these
systems differ significantly in scope and intended use.<ref
name="composable-dbms">{{cite journal |last1=Pedreira |first1=Pedro
|last2=Erling |first2=Orri |last3=Mühleisen |first3=Hannes |last4=Muñoz
|first4=Ruben |last5=Khaled |first5=Wael |last6=Dürsch |first6=Peter |title=The
Composable Data Management System Manifesto |journal=Proceedings of the VLDB
Endowment |volume=16 |issue=10 |year=2023 |doi=10.14778/3603581.3603604}}</ref>
+
+=== [[DuckDB]] ===
+
+[[DuckDB]] is an in-process [[online analytical processing|OLAP]] database for
direct use by end users, with its own storage format and catalog.<ref
name="duckdb-official">{{cite web |title=DuckDB |url=https://duckdb.org/
|website=DuckDB |access-date=2026-03-22}}</ref> DataFusion is a library for
building such systems, providing query planning and execution components that
other software can embed without a bundled persistent storage format.<ref
name="bauplan">{{cite web |title=Duck Hunt: Moving Bauplan from DuckDB to
DataFusion
|url=https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion
|website=Bauplan |date=2025-11-05 |access-date=2026-03-22}}</ref>
+
+=== [[Polars (software)|Polars]] ===
+
+[[Polars (software)|Polars]] is also written in [[Rust (programming
language)|Rust]] and uses the [[Apache Arrow]] memory model, but is designed as
a self-contained DataFrame library for data manipulation rather than an
embeddable query engine for building other systems.<ref
name="polars-official">{{cite web |title=Polars |url=https://pola.rs/
|website=Polars |access-date=2026-03-22}}</ref><ref name="faq">{{cite web
|title=Frequently Asked Questions
|url=https://datafusion.apache.org/user-guide/faq.html |website=Apache
DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref>
+
+=== [[Apache Spark]] ===
+
+[[Apache Spark]] is a distributed analytics framework for processing data at
cluster scale.<ref name="spark-sql">{{cite web |title=Spark SQL & DataFrames
|url=https://spark.apache.org/sql/ |website=Apache Spark
|access-date=2026-03-22}}</ref> DataFusion executes queries within a single
process and is aimed at building embedded analytics systems rather than
distributed workloads.<ref name="sigmod-paper" /> Apache projects that use
DataFusion to accelerate Spark include Apache DataFusion Comet, a native
execution plugin for Spark's [[Java virtual machine|JVM]]-based SQL execution
engine,<ref name="comet-donation">{{cite web |title=Announcing Apache Arrow
DataFusion Comet |url=https://arrow.apache.org/blog/2024/03/06/comet-donation/
|website=Apache Arrow Blog |publisher=Apache Software Foundation
|date=2024-03-06 |access-date=2026-03-22}}</ref> and [https://auron.apache.org/
Apache Auron], a Spark accelerator that combines the Apache Arrow-DataFusion
library with the Spark distributed
computing framework.<ref name="auron-intro">{{cite web |title=Introduction
|url=https://auron.apache.org/introduction.html |website=Apache Auron
|publisher=Apache Software Foundation |access-date=2026-03-23}}</ref>
+
+=== Velox ===
+
+[https://velox-lib.io/ Velox] is an execution engine library developed at
[[Meta Platforms|Meta]].<ref name="velox-vldb">{{cite journal |last1=Pedreira
|first1=Pedro |last2=Tan |first2=Wei |last3=Narayanan |first3=Deepak
|last4=Chattopadhyay |first4=Bikramjit |last5=Erling |first5=Orri |last6=Melnik
|first6=Sergey |last7=Bhagwan |first7=Ranjita |last8=Dumoulin |first8=Franck
|title=Velox: Meta's Unified Execution Engine |journal=Proceedings of the VLDB
Endowment |volume=15 |issue=12 |year=2022 |doi=10.14778/3554821.3554829}}</ref>
Unlike DataFusion, Velox does not include a SQL frontend or query planning
framework; it takes an already-optimized query plan as input and handles only
execution.<ref name="velox-docs">{{cite web |title=Velox in 10 Minutes
|url=https://facebookincubator.github.io/velox/velox-in-10-min.html
|website=Velox |access-date=2026-03-22}}</ref>
+
+== Adoption and reception ==
+
+DataFusion has been adopted across a range of analytics and database products.
[[Cloudflare]] used DataFusion in its Log Explorer product to execute SQL
queries over log data stored in Cloudflare R2.<ref name="cloudflare">{{cite web
|title=Cloudflare Log Explorer is now GA, providing native observability and
forensics |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The
Cloudflare Blog |publisher=Cloudflare |date=2025-06-18
|access-date=2026-03-22}}</ref> [[Palantir Technologies|Palantir]] Lightweight
Pipelines are powered by DataFusion.<ref name="palantir-2025">{{cite web
|title=Announcements: July 2025
|url=https://www.palantir.com/docs/foundry/announcements/2025-07
|website=Palantir Foundry Documentation |publisher=Palantir Technologies
|date=2025-07-29 |access-date=2026-03-22}}</ref><ref
name="palantir-2024">{{cite web |title=Announcements: February 2024
|url=https://www.palantir.com/docs/foundry/announcements/2024-02
|website=Palantir Foundry Documentation |publisher=P
alantir Technologies |date=February 2024 |access-date=2026-03-22}}</ref>
[[InfluxDB]] 3.0 uses DataFusion as part of the FDAP stack: Apache Flight,
DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web
|title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to
build InfluxDB 3.0
|url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
|website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other
users described in public sources include EDB Postgres AI,<ref
name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI
features into PostgreSQL
|url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/
|website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref>
Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's
semantic layer
|url=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer
|website=Cube |date=2024-06-03
|access-date=2026-03-22}}</ref> Spice AI,<ref name="spice">{{cite web
|title=How we use Apache DataFusion at Spice AI
|url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai
|website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic
Logfire,<ref name="logfire">{{cite web |title=We're changing database
|url=https://github.com/pydantic/logfire/issues/408 |website=GitHub
|date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref
name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for
connecting BI tools
|url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data
|date=2023-09-26 |access-date=2026-03-22}}</ref>
Review Comment:
I'm biased to want to include a link to rerun but we don't have a blog post
calling out DataFusion even though it is all over our repo. Will work on that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]