This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git


The following commit(s) were added to refs/heads/main by this push:
     new 57caa83  DataFusion 45 blog post (#57)
57caa83 is described below

commit 57caa83ee52a11c1473ceea321e9b9490f42585d
Author: Bruce Ritchie <bruce.ritc...@veeva.com>
AuthorDate: Tue Feb 25 06:00:55 2025 -0500

    DataFusion 45 blog post (#57)
    
    * DF 45 blog post
    
    * Update content/blog/2025-02-20-datafusion-45.0.0.md
    
    Co-authored-by: Yongting You <2010you...@gmail.com>
    
    * Update content/blog/2025-02-20-datafusion-45.0.0.md
    
    Co-authored-by: Yongting You <2010you...@gmail.com>
    
    * Set author to PMC.
    
    * Set author to PMC, incorporated feedback.
    
    * Update content/blog/2025-02-20-datafusion-45.0.0.md
    
    Co-authored-by: Andrew Lamb <and...@nerdnetworks.org>
    
    * expanded GSOC as it may not be obvious what it is and linked it up.
    
    * Grammar fix.
    
    * Typo fix
    
    * Typo fix
    
    * Adding spark functions to looking ahead section
    
    * minor change
    
    * Fixed Jonah Gao's handle.
    
    * Update content/blog/2025-02-20-datafusion-45.0.0.md
    
    Co-authored-by: Phillip LeBlanc <phil...@leblanc.tech>
    
    ---------
    
    Co-authored-by: Yongting You <2010you...@gmail.com>
    Co-authored-by: Andrew Lamb <and...@nerdnetworks.org>
    Co-authored-by: Phillip LeBlanc <phil...@leblanc.tech>
---
 content/blog/2025-02-20-datafusion-45.0.0.md       | 315 +++++++++++++++++++++
 .../datafusion-45.0.0/performance_over_time.png    | Bin 0 -> 52136 bytes
 2 files changed, 315 insertions(+)

diff --git a/content/blog/2025-02-20-datafusion-45.0.0.md 
b/content/blog/2025-02-20-datafusion-45.0.0.md
new file mode 100644
index 0000000..b9bd1df
--- /dev/null
+++ b/content/blog/2025-02-20-datafusion-45.0.0.md
@@ -0,0 +1,315 @@
+---
+layout: post
+title: Apache DataFusion 45.0.0 Released
+date: 2025-02-20
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+
+We are very proud to announce [DataFusion 45.0.0]. This blog highlights some 
of the
+many major improvements since we released [DataFusion 40.0.0] and a preview of
+what the community is thinking about in the next 6 months. It has been an 
exciting
+period of development for DataFusion!
+
+[DataFusion 40.0.0]: 
https://datafusion.apache.org/blog/2024/07/24/datafusion-40.0.0/
+[DataFusion 45.0.0]: https://crates.io/crates/datafusion/45.0.0
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that
+uses [Apache Arrow] as its in-memory format. DataFusion is used by developers 
to
+create new, fast data centric systems such as databases, dataframe libraries,
+machine learning and streaming applications. While [DataFusion’s primary design
+goal] is to accelerate the creation of other data centric systems, it has a
+reasonable experience directly out of the box as a [dataframe library],
+[python library] and [command line SQL tool].
+
+[apache datafusion]: https://datafusion.apache.org/
+[rust]: https://www.rust-lang.org/
+[apache arrow]: https://arrow.apache.org
+[DataFusion’s primary design goal]: 
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
+[python library]: https://datafusion.apache.org/python/
+[command line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+
+DataFusion's core thesis is that as a community, together we can build much 
more
+advanced technology than any of us as individuals or companies could do alone. 
+Without DataFusion, highly performant vectorized query engines would remain
+the domain of a few large companies and world-class research institutions. 
+With DataFusion, we can all build on top of a shared foundation, and focus on
+what makes our projects unique.
+
+
+## Community Growth  📈 
+
+In the last 6 months, between `40.0.0` and `45.0.0`, our community continues to
+grow in new and exciting ways.
+
+1. We added several PMC members and new committers: [@jayzhan211] and 
[@jonahgao] joined the PMC,
+   [@2010YOUY01], [@rachelint], [@findpi], [@iffyio], [@goldmedal], 
[@Weijun-H], [@Michael-J-Ward] and [@korowa]
+   joined as committers. See the [mailing list] for more details.
+2. In the [core DataFusion repo] alone we reviewed and accepted almost 1600 
PRs from 206 different
+   committers, created over 1100 issues and closed 751 of them 🚀. All changes 
are listed in the detailed
+   [changelogs].
+3. DataFusion focused meetups happened in multiple cities around the world: 
[Hangzhou], [Belgrade], [New York], 
+   [Seattle], [Chicago], [Boston] and [Amsterdam] as well as a Rust NYC meetup 
in NYC focused on DataFusion.
+
+[core DataFusion repo]: https://github.com/apache/arrow-datafusion
+[changelogs]: https://github.com/apache/datafusion/tree/main/dev/changelog
+[mailing list]: https://lists.apache.org/list.html?d...@datafusion.apache.org
+[Hangzhou]: 
https://github.com/apache/datafusion/discussions/10341#discussioncomment-10110273
+[Belgrade]: https://github.com/apache/datafusion/discussions/11431
+[New York]: https://github.com/apache/datafusion/discussions/11213
+[Seattle]: https://github.com/apache/datafusion/discussions/10348
+[Chicago]: https://github.com/apache/datafusion/discussions/12894
+[Boston]: https://github.com/apache/datafusion/discussions/13165 
+[Amsterdam]: https://github.com/apache/datafusion/discussions/12988
+
+<!--
+$ git log --pretty=oneline 40.0.0..45.0.0 . | wc -l
+     1532 (up from 1453)
+
+$ git shortlog -sn 40.0.0..45.0.0 . | wc -l
+     206 (up from 182)
+
+https://crates.io/crates/datafusion/45.0.0
+DataFusion 45 released Feb 7, 2025
+
+https://crates.io/crates/datafusion/40.0.0
+DataFusion 40 released July 12, 2024
+
+Issues created in this time: 375 open, 751 closed (from 321 open, 781 closed)
+https://github.com/apache/datafusion/issues?q=is%3Aissue+created%3A2024-07-12..2025-02-07
+
+Issues closed: 956 (up from 911)
+https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2024-07-12..2025-02-07
+
+PRs merged in this time 1597 (up from 1490)
+https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2024-07-12..2025-02-07
+
+-->
+
+DataFusion has put in an application to be part of [Google Summer of Code] 
with a 
+[number of ideas] for projects with mentors already selected. Additionally, 
[some ideas] on
+how to make DataFusion an ideal selection for university database projects 
such as the 
+[CMU database classes] have been put forward.
+
+[Google Summer of Code]: https://summerofcode.withgoogle.com/
+[number of ideas]: https://github.com/apache/datafusion/issues/14478
+[some ideas]: https://github.com/apache/datafusion/issues/14373
+[CMU database classes]: https://15445.courses.cs.cmu.edu/spring2025/
+
+In addition, DataFusion has been appearing publicly more and more, both online 
and offline. Here are some highlights:
+
+1. A [demonstration of how uwheel] is integrated into DataFusion
+2. Integrating StringView into DataFusion - [part 1] and [part 2]
+3. [Building streams] with DataFusion
+4. [Caching in DataFusion]: Don't read twice
+5. [Parquet pruning in DataFusion]: Read no more than you need
+6. DataFusion is one of [The 10 coolest open source software tools]
+7. [Building databases over a weekend]
+
+[demonstration of how uwheel]: https://uwheel.rs/post/datafusion_uwheel/
+[part 1]: 
https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/
+[part 2]: 
https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/
+[Building streams]: https://techontherocks.show/3
+[Caching in DataFusion]: https://blog.haoxp.xyz/posts/caching-datafusion
+[Parquet pruning in DataFusion]: https://blog.haoxp.xyz/posts/parquet-to-arrow/
+[The 10 coolest open source software tools]: 
https://www.crn.com/news/software/2024/the-10-coolest-open-source-software-tools-of-2024?page=3
+[Building databases over a weekend]: 
https://www.denormalized.io/blog/building-databases
+
+## Improved Performance 🚀 
+
+DataFusion hit a milestone in its development by becoming [the fastest single 
node engine] 
+for querying Apache Parquet files in [clickbench] benchmark for the 43.0.0 
release. A [lot 
+of work] went into making this happen! While other engines have subsequently 
gotten faster,
+displacing DataFusion from the top spot, DataFusion still remains near the top 
and we [are planning
+more improvements].
+
+<img
+src="/blog/images/datafusion-45.0.0/performance_over_time.png"
+width="100%"
+class="img-responsive"
+alt="ClickBench performance results over time for DataFusion"
+/>
+
+**Figure 1**: ClickBench performance improved over 33% between DataFusion 33
+(released Nov. 2023) and DataFusion 45 (released Feb. 2025). 
+
+The task of [integrating] the new [Arrow StringView] which significantly 
improves performance 
+for workloads that scan, filter and group by variable length string and binary 
data was completed 
+and enabled by default in the past 6 months. The improvement is especially 
pronounced for Parquet 
+files due to [upstream work in the parquet reader]. Kudos to [@XiangpengHong], 
[@AriesDevil], 
+[@PsiACE], [@Weijun-H], [@a10y], and [@RinChanNOWWW] for driving this project.
+
+[the fastest single node engine]: 
https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
+[clickbench]: https://benchmark.clickhouse.com/
+[lot of work]: https://github.com/apache/datafusion/issues/12821
+[are planning more improvements]: 
https://github.com/apache/datafusion/issues/14586
+[integrating]: https://github.com/apache/datafusion/issues/10918
+[Arrow StringView]: 
https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html
+[multiple variable length columns in the `GROUP BY` clause]: 
https://github.com/apache/datafusion/issues/9403
+[upstream work in the parquet reader]: 
https://github.com/apache/arrow-rs/issues/5530
+
+## Improved Quality 📋
+
+DataFusion continues to improve overall in quality. In addition to ongoing bug
+fixes, one of the most exciting improvements in the last 6 months was the 
addition of the 
+[SQLite sqllogictest suite] thanks to [@Omega359]. These tests run over 5 
million 
+sql statements on every push to the main branch.
+
+Support for [explicitly checking logical plan invariants] was added by 
[@wiedld] which 
+can help catch implicit changes that might cause problems during upgrades.
+
+We have also started other quality initiatives to make it [easier to use 
DataFusion] 
+based on [GlareDB]'s experience along with more [extensive prerelease 
testing].  
+
+[SQLite sqllogictest suite]: https://github.com/apache/datafusion/pull/13936
+[explicitly checking logical plan invariants]: 
https://github.com/apache/datafusion/pull/13651
+[easier to use DataFusion]: https://github.com/apache/datafusion/issues/13525
+[GlareDB]: https://glaredb.com/
+[extensive prerelease testing]: 
https://github.com/apache/datafusion/issues/13661
+
+
+## Improved Documentation 📚
+
+We continue to improve the documentation to make it easier to get started 
using DataFusion. 
+During the last 6 months two projects were initiated to migrate the function 
documentation
+from strictly static markdown files. First, [@Omega359] [created a framework] 
to allow function
+documentation to be generated from code and [@jonathanc-n] and others helped 
with the migration,
+then [@comphead] lead a project to [create a doc macro] to allow for an even 
easier way to write 
+function documentation. A special thanks to [@Chen-Yuan-Lai] for migrating 
many functions to 
+the new syntax.
+
+[created a framework]: https://github.com/apache/datafusion/pull/12668
+[create a doc macro]: https://github.com/apache/datafusion/pull/12822
+
+Additionally, the [examples] were [refactored] and [cleaned up] to improve 
their usefulness.
+
+[examples]: https://github.com/apache/datafusion/pull/13877
+[refactored]: https://github.com/apache/datafusion/pull/13905
+[cleaned up]: https://github.com/apache/datafusion/pull/13950
+
+## New Features ✨
+
+There are too many new features in the last 6 months to list them all, but here
+are some highlights:
+
+### Functions
+* Uniform Window Functions:  `BuiltInWindowFunctions` was removed and all now 
use UDFs ([@jcsherin])
+* Uniform Aggregate Functions: `BuiltInAggregateFunctions` was removed and all 
now use UDFs
+* As mentioned above function documentation was extracted from the markdown 
files
+* Some new functions and sql support were added including '[show functions]', 
'[to_local_time]',
+  '[regexp_count]', '[map_extract]', '[array_distance]', '[array_any_value]', 
'[greatest]',
+  '[least]', '[arrays_overlap]'
+
+### FFI
+* Foreign Function Interface work has started. This should allow for 
+  [using table providers] across languages and versions of DataFusion. This 
+  is especially pertinent for integration with [delta-rs] and other table 
formats.
+
+[delta-rs]: https://delta-io.github.io/delta-rs/
+
+### Materialized Views
+* [@suremarc] has added a [materialized view implementation] in 
datafusion-contrib 🚀
+
+### Substrait
+* A lot of work was put into improving and enhancing substrait support 
([@Blizzara], [@westonpace], [@tokoko], [@vbarua], [@LatrecheYasser], 
[@notfilippo] and others)
+
+[show functions]: https://github.com/apache/datafusion/pull/13799
+[to_local_time]: https://github.com/apache/datafusion/pull/11347
+[regexp_count]: https://github.com/apache/datafusion/pull/12970
+[map_extract]: https://github.com/apache/datafusion/pull/11969
+[array_distance]: https://github.com/apache/datafusion/pull/12211
+[array_any_value]: https://github.com/apache/datafusion/pull/12329
+[greatest]: https://github.com/apache/datafusion/pull/12474
+[least]: https://github.com/apache/datafusion/pull/13786
+[arrays_overlap]: https://github.com/apache/datafusion/pull/14217
+[using table providers]: https://github.com/apache/datafusion/pull/12920
+[materialized view implementation]: 
https://github.com/datafusion-contrib/datafusion-materialized-views
+
+## Looking Ahead: The Next Six Months 🔭 
+
+One of the long term goals of [@alamb], DataFusion's PMC chair, has been to 
have 
+[1000 DataFusion based projects]. This may be the year that happens!
+
+The community has been [discussing what we will work on in the next six 
months].
+Some major initiatives are likely to be:
+
+1. *Performance*: A [number of items have been identified] as areas that could 
use additional work
+2. *Memory usage*: Tracking and improving memory usage, statistics and 
spilling to disk 
+3. *[Google Summer of Code] (GSOC)*: DataFusion is hopefully selected as a 
project and we start accepting and supporting student projects 
+4. *FFI*: Extending the FFI implementation to support to all types of UDF's 
and SessionContext
+5. *Spark Functions*: A [proposal has been made to add a crate] covering spark 
compatible builtin functions 
+
+[1000 DataFusion based projects]: 
https://www.influxdata.com/blog/datafusion-2025-influxdb/
+[discussing what we will work on in the next six months]: 
https://github.com/apache/datafusion/issues/14580
+[number of items have been identified]: 
https://github.com/apache/datafusion/issues/14482
+[Google Summer of Code]: https://summerofcode.withgoogle.com/
+[proposal has been made to add a crate]: 
https://github.com/apache/datafusion/issues/5600
+
+## How to Get Involved
+
+DataFusion is not a project built or driven by a single person, company, or
+foundation. Rather, our community of users and contributors work together to
+build a shared technology that none of us could have built alone.
+
+If you are interested in joining us we would love to have you. You can try out
+DataFusion on some of your own data and projects and let us know how it goes,
+contribute suggestions, documentation, bug reports, or a PR with documentation,
+tests or code. A list of open issues suitable for beginners is [here] and you
+can find how to reach us on the [communication doc].
+
+[here]: 
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]: 
https://datafusion.apache.org/contributor-guide/communication.html
+
+[@jayzhan211]: https://github.com/jayzhan211
+[@jonahgao]: https://github.com/jonahgao
+[@2010YOUY01]: https://github.com/2010YOUY01
+[@rachelint]: https://github.com/rachelint
+[@findpi]: https://github.com/findepi/
+[@iffyio]: https://github.com/iffyio
+[@goldmedal]: https://github.com/goldmedal
+[@Weijun-H]: https://github.com/Weijun-H
+[@Michael-J-Ward]: https://github.com/Michael-J-Ward
+[@korowa]: https://github.com/korowa
+[@Omega359]: https://github.com/Omega359
+[@jonathanc-n]: https://github.com/jonathanc-n
+[@comphead]: https://github.com/comphead
+[@Chen-Yuan-Lai]: https://github.com/Chen-Yuan-Lai
+[@alamb]: https://github.com/alamb
+[@Blizzara]: https://github.com/Blizzara
+[@westonpace]: https://github.com/westonpace
+[@tokoko]: https://github.com/tokoko
+[@vbarua]: https://github.com/vbarua
+[@LatrecheYasser]: https://github.com/LatrecheYasser
+[@notfilippo]: https://github.com/notfilippo
+[@suremarc]: https://github.com/suremarc
+[@XiangpengHong]: https://github.com/XiangpengHong
+[@AriesDevil]: https://github.com/AriesDevil
+[@PsiACE]: https://github.com/PsiACE
+[@a10y]: https://github.com/a10y
+[@RinChanNOWWW]: https://github.com/RinChanNOWWW
+[@wiedld]: https://github.com/wiedld
+[@jcsherin]: https://github.com/jcsherin
\ No newline at end of file
diff --git a/content/images/datafusion-45.0.0/performance_over_time.png 
b/content/images/datafusion-45.0.0/performance_over_time.png
new file mode 100644
index 0000000..dc899fa
Binary files /dev/null and 
b/content/images/datafusion-45.0.0/performance_over_time.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

Reply via email to