This is an automated email from the ASF dual-hosted git repository. alamb pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/main by this push: new 57caa83 DataFusion 45 blog post (#57) 57caa83 is described below commit 57caa83ee52a11c1473ceea321e9b9490f42585d Author: Bruce Ritchie <bruce.ritc...@veeva.com> AuthorDate: Tue Feb 25 06:00:55 2025 -0500 DataFusion 45 blog post (#57) * DF 45 blog post * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Yongting You <2010you...@gmail.com> * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Yongting You <2010you...@gmail.com> * Set author to PMC. * Set author to PMC, incorporated feedback. * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Andrew Lamb <and...@nerdnetworks.org> * expanded GSOC as it may not be obvious what it is and linked it up. * Grammar fix. * Typo fix * Typo fix * Adding spark functions to looking ahead section * minor change * Fixed Jonah Gao's handle. * Update content/blog/2025-02-20-datafusion-45.0.0.md Co-authored-by: Phillip LeBlanc <phil...@leblanc.tech> --------- Co-authored-by: Yongting You <2010you...@gmail.com> Co-authored-by: Andrew Lamb <and...@nerdnetworks.org> Co-authored-by: Phillip LeBlanc <phil...@leblanc.tech> --- content/blog/2025-02-20-datafusion-45.0.0.md | 315 +++++++++++++++++++++ .../datafusion-45.0.0/performance_over_time.png | Bin 0 -> 52136 bytes 2 files changed, 315 insertions(+) diff --git a/content/blog/2025-02-20-datafusion-45.0.0.md b/content/blog/2025-02-20-datafusion-45.0.0.md new file mode 100644 index 0000000..b9bd1df --- /dev/null +++ b/content/blog/2025-02-20-datafusion-45.0.0.md @@ -0,0 +1,315 @@ +--- +layout: post +title: Apache DataFusion 45.0.0 Released +date: 2025-02-20 +author: pmc +categories: [release] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<!-- see https://github.com/apache/datafusion/issues/11631 for details --> + +## Introduction + +We are very proud to announce [DataFusion 45.0.0]. This blog highlights some of the +many major improvements since we released [DataFusion 40.0.0] and a preview of +what the community is thinking about in the next 6 months. It has been an exciting +period of development for DataFusion! + +[DataFusion 40.0.0]: https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/ +[DataFusion 45.0.0]: https://crates.io/crates/datafusion/45.0.0 + +[Apache DataFusion] is an extensible query engine, written in [Rust], that +uses [Apache Arrow] as its in-memory format. DataFusion is used by developers to +create new, fast data centric systems such as databases, dataframe libraries, +machine learning and streaming applications. While [DataFusion’s primary design +goal] is to accelerate the creation of other data centric systems, it has a +reasonable experience directly out of the box as a [dataframe library], +[python library] and [command line SQL tool]. + +[apache datafusion]: https://datafusion.apache.org/ +[rust]: https://www.rust-lang.org/ +[apache arrow]: https://arrow.apache.org +[DataFusion’s primary design goal]: https://datafusion.apache.org/user-guide/introduction.html#project-goals +[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html +[python library]: https://datafusion.apache.org/python/ +[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/ + +DataFusion's core thesis is that as a community, together we can build much more +advanced technology than any of us as individuals or companies could do alone. +Without DataFusion, highly performant vectorized query engines would remain +the domain of a few large companies and world-class research institutions. +With DataFusion, we can all build on top of a shared foundation, and focus on +what makes our projects unique. + + +## Community Growth 📈 + +In the last 6 months, between `40.0.0` and `45.0.0`, our community continues to +grow in new and exciting ways. + +1. We added several PMC members and new committers: [@jayzhan211] and [@jonahgao] joined the PMC, + [@2010YOUY01], [@rachelint], [@findpi], [@iffyio], [@goldmedal], [@Weijun-H], [@Michael-J-Ward] and [@korowa] + joined as committers. See the [mailing list] for more details. +2. In the [core DataFusion repo] alone we reviewed and accepted almost 1600 PRs from 206 different + committers, created over 1100 issues and closed 751 of them 🚀. All changes are listed in the detailed + [changelogs]. +3. DataFusion focused meetups happened in multiple cities around the world: [Hangzhou], [Belgrade], [New York], + [Seattle], [Chicago], [Boston] and [Amsterdam] as well as a Rust NYC meetup in NYC focused on DataFusion. + +[core DataFusion repo]: https://github.com/apache/arrow-datafusion +[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog +[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org +[Hangzhou]: https://github.com/apache/datafusion/discussions/10341#discussioncomment-10110273 +[Belgrade]: https://github.com/apache/datafusion/discussions/11431 +[New York]: https://github.com/apache/datafusion/discussions/11213 +[Seattle]: https://github.com/apache/datafusion/discussions/10348 +[Chicago]: https://github.com/apache/datafusion/discussions/12894 +[Boston]: https://github.com/apache/datafusion/discussions/13165 +[Amsterdam]: https://github.com/apache/datafusion/discussions/12988 + +<!-- +$ git log --pretty=oneline 40.0.0..45.0.0 . | wc -l + 1532 (up from 1453) + +$ git shortlog -sn 40.0.0..45.0.0 . | wc -l + 206 (up from 182) + +https://crates.io/crates/datafusion/45.0.0 +DataFusion 45 released Feb 7, 2025 + +https://crates.io/crates/datafusion/40.0.0 +DataFusion 40 released July 12, 2024 + +Issues created in this time: 375 open, 751 closed (from 321 open, 781 closed) +https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2024-07-12..2025-02-07 + +Issues closed: 956 (up from 911) +https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2024-07-12..2025-02-07 + +PRs merged in this time 1597 (up from 1490) +https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2024-07-12..2025-02-07 + +--> + +DataFusion has put in an application to be part of [Google Summer of Code] with a +[number of ideas] for projects with mentors already selected. Additionally, [some ideas] on +how to make DataFusion an ideal selection for university database projects such as the +[CMU database classes] have been put forward. + +[Google Summer of Code]: https://summerofcode.withgoogle.com/ +[number of ideas]: https://github.com/apache/datafusion/issues/14478 +[some ideas]: https://github.com/apache/datafusion/issues/14373 +[CMU database classes]: https://15445.courses.cs.cmu.edu/spring2025/ + +In addition, DataFusion has been appearing publicly more and more, both online and offline. Here are some highlights: + +1. A [demonstration of how uwheel] is integrated into DataFusion +2. Integrating StringView into DataFusion - [part 1] and [part 2] +3. [Building streams] with DataFusion +4. [Caching in DataFusion]: Don't read twice +5. [Parquet pruning in DataFusion]: Read no more than you need +6. DataFusion is one of [The 10 coolest open source software tools] +7. [Building databases over a weekend] + +[demonstration of how uwheel]: https://uwheel.rs/post/datafusion_uwheel/ +[part 1]: https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/ +[part 2]: https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/ +[Building streams]: https://techontherocks.show/3 +[Caching in DataFusion]: https://blog.haoxp.xyz/posts/caching-datafusion +[Parquet pruning in DataFusion]: https://blog.haoxp.xyz/posts/parquet-to-arrow/ +[The 10 coolest open source software tools]: https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3 +[Building databases over a weekend]: https://www.denormalized.io/blog/building-databases + +## Improved Performance 🚀 + +DataFusion hit a milestone in its development by becoming [the fastest single node engine] +for querying Apache Parquet files in [clickbench] benchmark for the 43.0.0 release. A [lot +of work] went into making this happen! While other engines have subsequently gotten faster, +displacing DataFusion from the top spot, DataFusion still remains near the top and we [are planning +more improvements]. + +<img +src="/blog/images/datafusion-45.0.0/performance_over_time.png" +width="100%" +class="img-responsive" +alt="ClickBench performance results over time for DataFusion" +/> + +**Figure 1**: ClickBench performance improved over 33% between DataFusion 33 +(released Nov. 2023) and DataFusion 45 (released Feb. 2025). + +The task of [integrating] the new [Arrow StringView] which significantly improves performance +for workloads that scan, filter and group by variable length string and binary data was completed +and enabled by default in the past 6 months. The improvement is especially pronounced for Parquet +files due to [upstream work in the parquet reader]. Kudos to [@XiangpengHong], [@AriesDevil], +[@PsiACE], [@Weijun-H], [@a10y], and [@RinChanNOWWW] for driving this project. + +[the fastest single node engine]: https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ +[clickbench]: https://benchmark.clickhouse.com/ +[lot of work]: https://github.com/apache/datafusion/issues/12821 +[are planning more improvements]: https://github.com/apache/datafusion/issues/14586 +[integrating]: https://github.com/apache/datafusion/issues/10918 +[Arrow StringView]: https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html +[multiple variable length columns in the `GROUP BY` clause]: https://github.com/apache/datafusion/issues/9403 +[upstream work in the parquet reader]: https://github.com/apache/arrow-rs/issues/5530 + +## Improved Quality 📋 + +DataFusion continues to improve overall in quality. In addition to ongoing bug +fixes, one of the most exciting improvements in the last 6 months was the addition of the +[SQLite sqllogictest suite] thanks to [@Omega359]. These tests run over 5 million +sql statements on every push to the main branch. + +Support for [explicitly checking logical plan invariants] was added by [@wiedld] which +can help catch implicit changes that might cause problems during upgrades. + +We have also started other quality initiatives to make it [easier to use DataFusion] +based on [GlareDB]'s experience along with more [extensive prerelease testing]. + +[SQLite sqllogictest suite]: https://github.com/apache/datafusion/pull/13936 +[explicitly checking logical plan invariants]: https://github.com/apache/datafusion/pull/13651 +[easier to use DataFusion]: https://github.com/apache/datafusion/issues/13525 +[GlareDB]: https://glaredb.com/ +[extensive prerelease testing]: https://github.com/apache/datafusion/issues/13661 + + +## Improved Documentation 📚 + +We continue to improve the documentation to make it easier to get started using DataFusion. +During the last 6 months two projects were initiated to migrate the function documentation +from strictly static markdown files. First, [@Omega359] [created a framework] to allow function +documentation to be generated from code and [@jonathanc-n] and others helped with the migration, +then [@comphead] lead a project to [create a doc macro] to allow for an even easier way to write +function documentation. A special thanks to [@Chen-Yuan-Lai] for migrating many functions to +the new syntax. + +[created a framework]: https://github.com/apache/datafusion/pull/12668 +[create a doc macro]: https://github.com/apache/datafusion/pull/12822 + +Additionally, the [examples] were [refactored] and [cleaned up] to improve their usefulness. + +[examples]: https://github.com/apache/datafusion/pull/13877 +[refactored]: https://github.com/apache/datafusion/pull/13905 +[cleaned up]: https://github.com/apache/datafusion/pull/13950 + +## New Features ✨ + +There are too many new features in the last 6 months to list them all, but here +are some highlights: + +### Functions +* Uniform Window Functions: `BuiltInWindowFunctions` was removed and all now use UDFs ([@jcsherin]) +* Uniform Aggregate Functions: `BuiltInAggregateFunctions` was removed and all now use UDFs +* As mentioned above function documentation was extracted from the markdown files +* Some new functions and sql support were added including '[show functions]', '[to_local_time]', + '[regexp_count]', '[map_extract]', '[array_distance]', '[array_any_value]', '[greatest]', + '[least]', '[arrays_overlap]' + +### FFI +* Foreign Function Interface work has started. This should allow for + [using table providers] across languages and versions of DataFusion. This + is especially pertinent for integration with [delta-rs] and other table formats. + +[delta-rs]: https://delta-io.github.io/delta-rs/ + +### Materialized Views +* [@suremarc] has added a [materialized view implementation] in datafusion-contrib 🚀 + +### Substrait +* A lot of work was put into improving and enhancing substrait support ([@Blizzara], [@westonpace], [@tokoko], [@vbarua], [@LatrecheYasser], [@notfilippo] and others) + +[show functions]: https://github.com/apache/datafusion/pull/13799 +[to_local_time]: https://github.com/apache/datafusion/pull/11347 +[regexp_count]: https://github.com/apache/datafusion/pull/12970 +[map_extract]: https://github.com/apache/datafusion/pull/11969 +[array_distance]: https://github.com/apache/datafusion/pull/12211 +[array_any_value]: https://github.com/apache/datafusion/pull/12329 +[greatest]: https://github.com/apache/datafusion/pull/12474 +[least]: https://github.com/apache/datafusion/pull/13786 +[arrays_overlap]: https://github.com/apache/datafusion/pull/14217 +[using table providers]: https://github.com/apache/datafusion/pull/12920 +[materialized view implementation]: https://github.com/datafusion-contrib/datafusion-materialized-views + +## Looking Ahead: The Next Six Months 🔭 + +One of the long term goals of [@alamb], DataFusion's PMC chair, has been to have +[1000 DataFusion based projects]. This may be the year that happens! + +The community has been [discussing what we will work on in the next six months]. +Some major initiatives are likely to be: + +1. *Performance*: A [number of items have been identified] as areas that could use additional work +2. *Memory usage*: Tracking and improving memory usage, statistics and spilling to disk +3. *[Google Summer of Code] (GSOC)*: DataFusion is hopefully selected as a project and we start accepting and supporting student projects +4. *FFI*: Extending the FFI implementation to support to all types of UDF's and SessionContext +5. *Spark Functions*: A [proposal has been made to add a crate] covering spark compatible builtin functions + +[1000 DataFusion based projects]: https://www.influxdata.com/blog/datafusion-2025-influxdb/ +[discussing what we will work on in the next six months]: https://github.com/apache/datafusion/issues/14580 +[number of items have been identified]: https://github.com/apache/datafusion/issues/14482 +[Google Summer of Code]: https://summerofcode.withgoogle.com/ +[proposal has been made to add a crate]: https://github.com/apache/datafusion/issues/5600 + +## How to Get Involved + +DataFusion is not a project built or driven by a single person, company, or +foundation. Rather, our community of users and contributors work together to +build a shared technology that none of us could have built alone. + +If you are interested in joining us we would love to have you. You can try out +DataFusion on some of your own data and projects and let us know how it goes, +contribute suggestions, documentation, bug reports, or a PR with documentation, +tests or code. A list of open issues suitable for beginners is [here] and you +can find how to reach us on the [communication doc]. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://datafusion.apache.org/contributor-guide/communication.html + +[@jayzhan211]: https://github.com/jayzhan211 +[@jonahgao]: https://github.com/jonahgao +[@2010YOUY01]: https://github.com/2010YOUY01 +[@rachelint]: https://github.com/rachelint +[@findpi]: https://github.com/findepi/ +[@iffyio]: https://github.com/iffyio +[@goldmedal]: https://github.com/goldmedal +[@Weijun-H]: https://github.com/Weijun-H +[@Michael-J-Ward]: https://github.com/Michael-J-Ward +[@korowa]: https://github.com/korowa +[@Omega359]: https://github.com/Omega359 +[@jonathanc-n]: https://github.com/jonathanc-n +[@comphead]: https://github.com/comphead +[@Chen-Yuan-Lai]: https://github.com/Chen-Yuan-Lai +[@alamb]: https://github.com/alamb +[@Blizzara]: https://github.com/Blizzara +[@westonpace]: https://github.com/westonpace +[@tokoko]: https://github.com/tokoko +[@vbarua]: https://github.com/vbarua +[@LatrecheYasser]: https://github.com/LatrecheYasser +[@notfilippo]: https://github.com/notfilippo +[@suremarc]: https://github.com/suremarc +[@XiangpengHong]: https://github.com/XiangpengHong +[@AriesDevil]: https://github.com/AriesDevil +[@PsiACE]: https://github.com/PsiACE +[@a10y]: https://github.com/a10y +[@RinChanNOWWW]: https://github.com/RinChanNOWWW +[@wiedld]: https://github.com/wiedld +[@jcsherin]: https://github.com/jcsherin \ No newline at end of file diff --git a/content/images/datafusion-45.0.0/performance_over_time.png b/content/images/datafusion-45.0.0/performance_over_time.png new file mode 100644 index 0000000..dc899fa Binary files /dev/null and b/content/images/datafusion-45.0.0/performance_over_time.png differ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org