Re: [PR] add first draft of wikipedia article [datafusion]

via GitHub Mon, 23 Mar 2026 07:27:12 -0700


alamb commented on code in PR #21105:
URL: https://github.com/apache/datafusion/pull/21105#discussion_r2975395755



##########
dev/wiki/apache-datafusion.wikitext:
##########
@@ -0,0 +1,117 @@
+<!--
+Draft Wikipedia article.
+-->
+
+{{Short description|Open-source query engine}}
+{{Draft topics|technology|software}}
+{{Infobox software
+| name = Apache DataFusion
+| developer = [[Apache Software Foundation]]
+| programming language = [[Rust (programming language)|Rust]]
+| genre = Query engine
+| license = [[Apache License]]
+| website = {{URL|https://datafusion.apache.org/}}
+}}
+
+'''Apache DataFusion''' is an [[open-source software|open-source]], embeddable 
analytical query engine written in [[Rust (programming language)|Rust]], built 
on [[Apache Arrow]]'s columnar memory format.<ref name="sigmod-paper">{{cite 
journal |last1=Lamb |first1=Andrew |last2=Shen |first2=Yijie |last3=Heres 
|first3=Daniel |last4=Chakraborty |first4=Jayjeet |last5=Kabak |first5=Mehmet 
Ozan |last6=Hsieh |first6=Liang-Chi |last7=Sun |first7=Chao |title=Apache Arrow 
DataFusion: A Fast, Embeddable, Modular Analytic Query Engine 
|journal=Proceedings of the 2024 International Conference on Management of Data 
|year=2024 |doi=10.1145/3626246.3653368}}</ref><ref name="intro-docs">{{cite 
web |title=Introduction 
|url=https://datafusion.apache.org/user-guide/introduction.html |website=Apache 
DataFusion |publisher=Apache Software Foundation 
|access-date=2026-03-22}}</ref> It provides [[SQL]] and DataFrame interfaces 
for analytical query execution and is designed to be used as a library by 
develop
 ers building databases, query engines, and analytical tools, rather than as a 
standalone database server.<ref name="sigmod-paper" /><ref name="intro-docs" /> 
The project originated in 2017, was donated to the [[Apache Arrow]] project in 
2019, and became a top-level project of the [[Apache Software Foundation]] in 
2024.<ref name="donation-post">{{cite web |title=DataFusion: A Rust-native 
Query Engine for Apache Arrow 
|url=https://datafusion.apache.org/blog/2019/02/04/datafusion-donation/ 
|website=Apache DataFusion Blog |publisher=Apache Software Foundation 
|date=2019-02-04 |access-date=2026-03-22}}</ref><ref name="asf-tlp">{{cite web 
|title=Apache Software Foundation Announces New Top-Level Project Apache 
DataFusion 
|url=https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-datafusion
 |website=The ASF Blog |publisher=Apache Software Foundation |date=2024-06-11 
|access-date=2026-03-22}}</ref>
+
+== History ==
+
+DataFusion was first announced as a Rust-native query engine for Apache Arrow 
in February 2019. That announcement said the project had started about two 
years earlier and had recently been reimplemented to be Arrow-native before its 
donation to Apache Arrow.<ref name="donation-post" />
+
+After its donation, DataFusion was developed within the Apache Arrow project. 
Its development during the early 2020s coincided with wider adoption of Rust 
and Arrow-based analytical systems.<ref name="sigmod-paper" />
+
+In 2024, a paper describing DataFusion was accepted to the industry track of 
the [[ACM SIGMOD]] conference.<ref name="sigmod-accepted">{{cite web 
|title=SIGMOD 2024 Industrial Track: Accepted Papers 
|url=https://2024.sigmod.org/industrial-list.shtml |website=SIGMOD 2024 
|access-date=2026-03-22}}</ref><ref name="sigmod-paper" /> On April 16, 2024, 
the project graduated from Apache Arrow and became a top-level Apache project; 
the Apache Software Foundation publicly announced the change on June 11, 
2024.<ref name="asf-tlp" />
+
+== Features ==
+
+DataFusion is a fast, extensible query engine for building data systems. It 
provides a SQL interface and a DataFrame API for constructing queries 
programmatically, a [[query plan|query planner]] and rule-based [[query 
optimization|optimizer]], and a multithreaded vectorized execution engine that 
processes data in columnar batches rather than row by row.<ref 
name="sigmod-paper" /><ref name="intro-docs" />
+
+The engine reads common analytical file formats natively, including [[Apache 
Parquet]], [[comma-separated values|CSV]], [[JSON]], [[Apache Avro|Avro]], and 
Arrow IPC, and uses [[Apache Arrow]]'s columnar memory format throughout 
execution, avoiding [[serialization]] overhead between stages.<ref 
name="sigmod-paper" />
+
+DataFusion is designed for in-process embedding: it runs within the host 
application's process rather than as a separate server, using threads for 
parallel query execution. Its extension points allow downstream systems to add 
[[user-defined function|user-defined functions]], custom data sources, custom 
query languages, and new optimizer rules, enabling developers to build 
specialized database systems on top of DataFusion's planning and execution 
components without reimplementing them.<ref name="sigmod-paper" /><ref 
name="intro-docs" />
+
+== Comparison with related systems ==
+
+DataFusion is frequently compared with other columnar analytical systems 
including [[DuckDB]], [[Polars (software)|Polars]], and Velox, but these 
systems differ significantly in scope and intended use.<ref 
name="composable-dbms">{{cite journal |last1=Pedreira |first1=Pedro 
|last2=Erling |first2=Orri |last3=Mühleisen |first3=Hannes |last4=Muñoz 
|first4=Ruben |last5=Khaled |first5=Wael |last6=Dürsch |first6=Peter |title=The 
Composable Data Management System Manifesto |journal=Proceedings of the VLDB 
Endowment |volume=16 |issue=10 |year=2023 |doi=10.14778/3603581.3603604}}</ref>
+
+=== [[DuckDB]] ===
+
+[[DuckDB]] is an in-process [[online analytical processing|OLAP]] database for 
direct use by end users, with its own storage format and catalog.<ref 
name="duckdb-official">{{cite web |title=DuckDB |url=https://duckdb.org/ 
|website=DuckDB |access-date=2026-03-22}}</ref> DataFusion is a library for 
building such systems, providing query planning and execution components that 
other software can embed without a bundled persistent storage format.<ref 
name="bauplan">{{cite web |title=Duck Hunt: Moving Bauplan from DuckDB to 
DataFusion 
|url=https://www.bauplanlabs.com/post/duck-hunt-moving-bauplan-from-duckdb-to-datafusion
 |website=Bauplan |date=2025-11-05 |access-date=2026-03-22}}</ref>
+
+=== [[Polars (software)|Polars]] ===
+
+[[Polars (software)|Polars]] is also written in [[Rust (programming 
language)|Rust]] and uses the [[Apache Arrow]] memory model, but is designed as 
a self-contained DataFrame library for data manipulation rather than an 
embeddable query engine for building other systems.<ref 
name="polars-official">{{cite web |title=Polars |url=https://pola.rs/ 
|website=Polars |access-date=2026-03-22}}</ref><ref name="faq">{{cite web 
|title=Frequently Asked Questions 
|url=https://datafusion.apache.org/user-guide/faq.html |website=Apache 
DataFusion |publisher=Apache Software Foundation |access-date=2026-03-22}}</ref>
+
+=== [[Apache Spark]] ===
+
+[[Apache Spark]] is a distributed analytics framework for processing data at 
cluster scale.<ref name="spark-sql">{{cite web |title=Spark SQL & DataFrames 
|url=https://spark.apache.org/sql/ |website=Apache Spark 
|access-date=2026-03-22}}</ref> DataFusion executes queries within a single 
process and is aimed at building embedded analytics systems rather than 
distributed workloads.<ref name="sigmod-paper" /> Apache DataFusion Comet, 
originally developed at [[Apple Inc.|Apple]] and donated to the Apache Software 
Foundation, is a native execution plugin that uses DataFusion to accelerate 
Spark's [[Java virtual machine|JVM]]-based SQL execution engine.<ref 
name="comet-donation" />
+
+=== Velox ===
+
+[https://velox-lib.io/ Velox] is an execution engine library developed at 
[[Meta Platforms|Meta]].<ref name="velox-vldb">{{cite journal |last1=Pedreira 
|first1=Pedro |last2=Tan |first2=Wei |last3=Narayanan |first3=Deepak 
|last4=Chattopadhyay |first4=Bikramjit |last5=Erling |first5=Orri |last6=Melnik 
|first6=Sergey |last7=Bhagwan |first7=Ranjita |last8=Dumoulin |first8=Franck 
|title=Velox: Meta's Unified Execution Engine |journal=Proceedings of the VLDB 
Endowment |volume=15 |issue=12 |year=2022 |doi=10.14778/3554821.3554829}}</ref> 
Unlike DataFusion, Velox does not include a SQL frontend or query planning 
framework; it takes an already-optimized query plan as input and handles only 
execution.<ref name="velox-docs">{{cite web |title=Velox in 10 Minutes 
|url=https://facebookincubator.github.io/velox/velox-in-10-min.html 
|website=Velox |access-date=2026-03-22}}</ref>
+
+== Adoption and reception ==
+
+DataFusion has been adopted across a range of analytics and database products. 
[[Palantir Technologies|Palantir]] Foundry's release notes state that its 
Lightweight Pipelines are powered by DataFusion for rapid, low-latency data 
processing.<ref name="palantir-2025">{{cite web |title=Announcements: July 2025 
|url=https://www.palantir.com/docs/foundry/announcements/2025-07 
|website=Palantir Foundry Documentation |publisher=Palantir Technologies 
|date=2025-07-29 |access-date=2026-03-22}}</ref><ref 
name="palantir-2024">{{cite web |title=Announcements: February 2024 
|url=https://www.palantir.com/docs/foundry/announcements/2024-02 
|website=Palantir Foundry Documentation |publisher=Palantir Technologies 
|date=February 2024 |access-date=2026-03-22}}</ref> [[Cloudflare]] used 
DataFusion to execute SQL queries over log data stored in Cloudflare R2 in its 
Log Explorer product.<ref name="cloudflare">{{cite web |title=Cloudflare Log 
Explorer is now GA, providing native observability and forensic
 s |url=https://blog.cloudflare.com/logexplorer-ga/ |website=The Cloudflare 
Blog |publisher=Cloudflare |date=2025-06-18 |access-date=2026-03-22}}</ref> 
[[InfluxDB]] 3.0 was rebuilt on what InfluxData called the FDAP stack: Apache 
Flight, DataFusion, Arrow, and Parquet.<ref name="influx-fdap">{{cite web 
|title=Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to 
build InfluxDB 3.0 
|url=https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/
 |website=InfluxData |date=2023-10-25 |access-date=2026-03-22}}</ref> Other 
adopters described in public references include EDB Postgres AI,<ref 
name="siliconangle-edb">{{cite web |title=Enterprise DB begins rolling AI 
features into PostgreSQL 
|url=https://siliconangle.com/2024/05/23/enterprise-db-begins-rolling-ai-features-postgresql/
 |website=SiliconANGLE |date=2024-05-23 |access-date=2026-03-22}}</ref> 
Cube,<ref name="cube-pushdown">{{cite web |title=Query pushdown in Cube's 
semantic layer |ur
 l=https://cube.dev/blog/query-push-down-in-cubes-semantic-layer |website=Cube 
|date=2024-06-03 |access-date=2026-03-22}}</ref> Spice AI,<ref 
name="spice">{{cite web |title=How we use Apache DataFusion at Spice AI 
|url=https://spice.ai/blog/how-we-use-apache-datafusion-at-spice-ai 
|website=Spice AI |date=2026-01-17 |access-date=2026-03-22}}</ref> Pydantic 
Logfire,<ref name="logfire">{{cite web |title=We're changing database 
|url=https://github.com/pydantic/logfire/issues/408 |website=GitHub 
|date=2024-08-29 |access-date=2026-03-22}}</ref> and Kamu.<ref 
name="kamu">{{cite web |title=100X faster ingestion, and FlightSQL support for 
connecting BI tools 
|url=https://www.kamu.dev/blog/2023-09-datafusion-flightsql/ |website=Kamu Data 
|date=2023-09-26 |access-date=2026-03-22}}</ref>
+
+In 2024, ''CRN'' included Apache DataFusion in its list of "The 10 Coolest 
Open-Source Software Tools Of 2024".<ref name="crn">{{cite web |title=The 10 
Coolest Open-Source Software Tools Of 2024 
|url=https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3
 |website=CRN |date=2024-11-21 |access-date=2026-03-22}}</ref>
+

Review Comment:
   I would also probably choose to start with cloudflare workers rather than 
palantir as I know that usecase better (and they wrote a specific blog which I 
think is stronger)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] add first draft of wikipedia article [datafusion]

Reply via email to