This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/main by this push:
new 5e70777 DataFusion 52 release post (#135)
5e70777 is described below
commit 5e707778d00733f23fce6f36334df5a1f9cba24e
Author: Andrew Lamb <[email protected]>
AuthorDate: Wed Jan 28 14:33:51 2026 -0500
DataFusion 52 release post (#135)
* Initial draft (coded with codex)
* updates
* updates
* Update sql planning
* Apply suggestions from code review
Co-authored-by: Matt Butrovich <[email protected]>
* Updates
* acknowledgments
* update
* updates
* update
* typos
* refine
* clean
* Update content/blog/2026-01-08-datafusion-52.0.0.md
Co-authored-by: Martin Grigorov <[email protected]>
* Clarify RelationPlanner syntax
* remove extra section
* Update content/blog/2026-01-08-datafusion-52.0.0.md
Co-authored-by: Adrian Garcia Badaracco
<[email protected]>
* Refine wording
* reflow
* Metadata cache is general
* Add section on Min/max dynamic filters
* Update content/blog/2026-01-08-datafusion-52.0.0.md
Co-authored-by: Yongting You <[email protected]>
* whitespace
* Improve sort pushdown description
* capitalizaton
* wordsmith
---------
Co-authored-by: Matt Butrovich <[email protected]>
Co-authored-by: Martin Grigorov <[email protected]>
Co-authored-by: Adrian Garcia Badaracco
<[email protected]>
Co-authored-by: Yongting You <[email protected]>
---
content/blog/2026-01-08-datafusion-52.0.0.md | 410 +++++++++++++++++++++++++++
1 file changed, 410 insertions(+)
diff --git a/content/blog/2026-01-08-datafusion-52.0.0.md
b/content/blog/2026-01-08-datafusion-52.0.0.md
new file mode 100644
index 0000000..bb2d7d0
--- /dev/null
+++ b/content/blog/2026-01-08-datafusion-52.0.0.md
@@ -0,0 +1,410 @@
+---
+layout: post
+title: Apache DataFusion 52.0.0 Released
+date: 2026-01-08
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+We are proud to announce the release of [DataFusion 52.0.0]. This post
highlights
+some of the major improvements since [DataFusion 51.0.0]. The complete list of
+changes is available in the [changelog]. Thanks to the [121 contributors] for
+making this release possible.
+
+
+[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0
+[DataFusion 51.0.0]:
https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/
+[changelog]:
https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md
+[121 contributors]:
https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits
+
+## Performance Improvements 🚀
+
+We continue to make significant performance improvements in DataFusion as
explained below.
+
+### Faster `CASE` Expressions
+
+DataFusion 52 has lookup-table-based evaluation for certain `CASE` expressions
+to avoid repeated evaluation for accelerating common ETL patterns such as
+
+```sql
+CASE company
+ WHEN 1 THEN 'Apple'
+ WHEN 5 THEN 'Samsung'
+ WHEN 2 THEN 'Motorola'
+ WHEN 3 THEN 'LG'
+ ELSE 'Other'
+END
+```
+
+This is the final work in our `CASE` performance epic ([#18075]), which has
+improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to
+[rluvaton] and [pepijnve] for the implementation.
+
+[rluvaton]: https://github.com/rluvaton
+[pepijnve]: https://github.com/pepijnve
+
+
+[#18075]: https://github.com/apache/datafusion/issues/18075
+[#18183]: https://github.com/apache/datafusion/pull/18183
+
+### `MIN`/`MAX` Aggregate Dynamic Filters
+
+DataFusion now creates dynamic filters for queries with `MIN`/`MAX` aggregates
+that have filters, but no `GROUP BY`. These dynamic filters are used during
scan
+to prune files and rows as tighter bounds are discovered during execution, as
+explained in the [Dynamic Filtering Blog]. For example, the following query:
+
+```sql
+SELECT min(l_shipdate)
+FROM lineitem
+WHERE l_returnflag = 'R';
+```
+
+Is now executed like this
+```sql
+SELECT min(l_shipdate)
+FROM lineitem
+-- '__current_min' is updated dynamically during execution
+WHERE l_returnflag = 'R' AND l_shipdate < __current_min;
+```
+
+Thanks to [2010YOUY01] for implementing this feature, with reviews from
+[martin-g], [adriangb], and [LiaCastaneda]. Related PRs: [#18644]
+
+[#18644]: https://github.com/apache/datafusion/pull/18644
+[2010YOUY01]: https://github.com/2010YOUY01
+[martin-g]: https://github.com/martin-g
+[adriangb]: https://github.com/adriangb
+[LiaCastaneda]: https://github.com/LiaCastaneda
+
+### New Merge Join
+
+DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with
+speedups of three orders of magnitude in some pathological cases such as the
+case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in
+[#18875] show dramatic gains for TPC-H Q21 (minutes to milliseconds) while
+leaving other queries unchanged or modestly faster. Thanks to [mbutrovich] for
+the implementation and reviews from [Dandandan].
+
+[#18487]: https://github.com/apache/datafusion/issues/18487
+[#18875]: https://github.com/apache/datafusion/pull/18875
+[Apache Comet]: https://datafusion.apache.org/comet/
+[mbutrovich]: https://github.com/mbutrovich
+
+
+### Caching Improvements
+
+This release also includes several additional caching improvements.
+
+A new statistics cache for File Metadata avoids repeatedly (re)calculating
+statistics for files. This significantly improves planning time
+for certain queries. You can see the contents of the new cache using the
+[statistics_cache] function in the CLI:
+
+[statistics_cache]:
https://datafusion.apache.org/user-guide/cli/functions.html#statistics-cache
+
+
+```sql
+select * from statistics_cache();
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+| path | file_modified | file_size_bytes | e_tag
| version | num_rows | num_columns | table_size_bytes |
statistics_size_bytes |
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446 |
0-5e24d1ee16380-370f48 | NULL | Exact(99997497) | 105 |
Exact(36445943240) | 0 |
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+```
+Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics
cache,
+with reviews from [martin-g], [alamb], and [alchemist51].
+Related PRs: [#18971], [#19054]
+
+[#18971]: https://github.com/apache/datafusion/pull/18971
+[#19054]: https://github.com/apache/datafusion/pull/19054
+[bharath-techie]: https://github.com/bharath-techie
+[nuno-faria]: https://github.com/nuno-faria
+[martin-g]: https://github.com/martin-g
+[alchemist51]: https://github.com/alchemist51
+
+
+A prefix-aware list-files cache accelerates evaluating partition predicates for
+Hive partitioned tables.
+
+```sql
+-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files)
+CREATE EXTERNAL TABLE overturemaps
+STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/';
+-- Find all files where the path contains `theme=base without requiring
another LIST call
+select count(*) from overturemaps where theme='base';
+```
+
+You can see the
+contents of the new cache using the [list_files_cache] function in the CLI:
+
+[list_files_cache]:
https://datafusion.apache.org/user-guide/cli/functions.html#list-files-cache
+
+```sql
+create external table overturemaps
+stored as parquet
+location
's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure';
+0 row(s) fetched.
+> select table, path, metadata_size_bytes, expires_in,
unnest(metadata_list)['file_size_bytes'] as file_size_bytes,
unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10;
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+| table | path |
metadata_size_bytes | expires_in | file_size_bytes |
e_tag |
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 999055952 |
"35fc8fbe8400960b54c66fbb408c48e8-60" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 975592768 |
"8a16e10b722681cdc00242564b502965-59" |
+...
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 1016732378 |
"6d70857a0473ed9ed3fc6e149814168b-61" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 991363784 |
"c9cafb42fcbb413f851691c895dd7c2b-60" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750
| 0 days 0 hours 0 mins 25.264 secs | 1032469715 |
"7540252d0d67158297a67038a3365e0f-62" |
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+```
+
+Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files
cache work,
+with reviews from [gabotechs], [alamb], [alchemist51], [martin-g], and
[BlakeOrth].
+Related PRs: [#18146], [#18855], [#19366], [#19298],
+
+[Epic #17214]: https://github.com/apache/datafusion/issues/17214
+[#18146]: https://github.com/apache/datafusion/pull/18146
+[#18855]: https://github.com/apache/datafusion/pull/18855
+[#19366]: https://github.com/apache/datafusion/pull/19366
+[#19298]: https://github.com/apache/datafusion/pull/19298
+[BlakeOrth]: https://github.com/BlakeOrth
+[Yuvraj-cyborg]: https://github.com/Yuvraj-cyborg
+
+
+### Improved Hash Join Filter Pushdown
+
+Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
+dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
+technique referred to as [Sideways Information Passing] in Database research
+literature. The initial implementation passed min/max values for the join keys.
+DataFusion 52 extends the optimization ([#17171] / [#18393]) to pass the
+contents of the build side hash map. These filters are evaluated on the probe
+side scan to prune files, row groups, and individual rows. When the build side
+contains `20` or fewer rows (configurable) the contents of the hash map are
+transformed to an `IN` expression and used for [statistics-based pruning] which
+can avoid reading entire files or row groups that contain no matching join
keys.
+Thanks to [adriangb] for implementing this feature, with reviews from
+[LiaCastaneda], [asolimando], [comphead], and [mbutrovich].
+
+
+[Sideways Information Passing]:
https://dl.acm.org/doi/10.1109/ICDE.2008.4497486
+[Dynamic Filtering Blog]:
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters
+[statistics-based pruning]:
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
+
+[#17171]: https://github.com/apache/datafusion/issues/17171
+[#18393]: https://github.com/apache/datafusion/pull/18393
+[adriangb]: https://github.com/adriangb
+[LiaCastaneda]: https://github.com/LiaCastaneda
+[asolimando]: https://github.com/asolimando
+[comphead]: https://github.com/comphead
+
+
+## Major Features ✨
+
+### Arrow IPC Stream file support
+
+DataFusion can now read Arrow IPC stream files ([#18457]). This expands
+interoperability with systems that emit Arrow streams directly, making it
+simpler to ingest Arrow-native data without conversion. Thanks to
[corasaurus-hex]
+for implementing this feature, with reviews from [martin-g], [Jefffrey],
+[jdcasale], [2010YOUY01], and [timsaucer].
+
+```sql
+CREATE EXTERNAL TABLE ipc_events
+STORED AS ARROW
+LOCATION 's3://bucket/events.arrow';
+```
+
+Related PRs: [#18457]
+
+[#18457]: https://github.com/apache/datafusion/pull/18457
+[corasaurus-hex]: https://github.com/corasaurus-hex
+[Jefffrey]: https://github.com/Jefffrey
+[jdcasale]: https://github.com/jdcasale
+[2010YOUY01]: https://github.com/2010YOUY01
+[timsaucer]: https://github.com/timsaucer
+
+### More Extensible SQL Planning with `RelationPlanner`
+
+DataFusion now has an API for extending the SQL planner for relations, as
+explained in the [Extending SQL in DataFusion Blog]. In addition to the
existing
+expression and types extension points, this new API now allows extending `FROM`
+clauses. Using these APIs it is straightforward to provide SQL support for
+almost any dialect, including vendor-specific syntax. Example use cases
include:
+
+
+```sql
+-- Postgres-style JSON operators
+SELECT payload->'user'->>'id' FROM logs;
+-- MySQL-specific types
+SELECT DATETIME '2001-01-01 18:00:00';
+-- Statistical sampling
+SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT);
+```
+[Extending SQL in DataFusion Blog]:
https://datafusion.apache.org/blog/2026/01/12/extending-sql/
+
+Thanks to [geoffreyclaude] for implementing relation planner extensions, and to
+[theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback on the
+design. Related PRs: [#17843]
+
+[#17843]: https://github.com/apache/datafusion/pull/17843
+[geoffreyclaude]: https://github.com/geoffreyclaude
+[theirix]: https://github.com/theirix
+[alamb]: https://github.com/alamb
+[NGA-TRAN]: https://github.com/NGA-TRAN
+[gabotechs]: https://github.com/gabotechs
+
+### Expression Evaluation Pushdown to Scans
+
+DataFusion now pushes down expression evaluation into TableProviders using
+[PhysicalExprAdapter], replacing the older SchemaAdapter approach ([#14993],
+[#16800]). Predicates and expressions can now be customized for each
+individual file schema, opening additional optimization such as support for
+[Variant shredding]. Thanks to [adriangb] for implementing PhysicalExprAdapter
+and reworking pushdown to use it. Related PRs: [#18998], [#19345]
+
+[#14993]: https://github.com/apache/datafusion/issues/14993
+[#16800]: https://github.com/apache/datafusion/issues/16800
+[#18998]: https://github.com/apache/datafusion/pull/18998
+[#19345]: https://github.com/apache/datafusion/pull/19345
+[kosiew]: https://github.com/kosiew
+[Variant shredding]: https://github.com/apache/datafusion/issues/16116
+[PhysicalExprAdapter]:
https://docs.rs/datafusion/52.0.0/datafusion/physical_expr_adapter/trait.PhysicalExprAdapter.html
+
+### Sort Pushdown to Scans
+
+DataFusion can now push sorts into data sources ([#10433], [#19064]).
+This allows table provider implementations to optimize based on
+sort knowledge for certain query patterns. For example, the provided Parquet
+data source now reverses the scan order of row groups and files when queried
+for the opposite of the file's natural sort (e.g. `DESC` when the files are
sorted `ASC`).
+This reversal, combined with dynamic filtering, allows top-K queries with
`LIMIT`
+on pre-sorted data to find the requested rows very quickly, pruning more files
and row groups
+without even scanning them. We have seen a ~30x performance improvement on
+benchmark queries with pre-sorted data.
+Thanks to [zhuqi-lucas] and [xudong963] for this feature, with reviews from
+[martin-g], [adriangb], and [alamb].
+
+[#10433]: https://github.com/apache/datafusion/issues/10433
+[#19064]: https://github.com/apache/datafusion/pull/19064
+[zhuqi-lucas]: https://github.com/zhuqi-lucas
+[xudong963]: https://github.com/xudong963
+
+### `TableProvider` supports `DELETE` and `UPDATE` statements
+
+The [TableProvider] trait now includes hooks for `DELETE` and `UPDATE`
+statements and the basic MemTable implements them ([#19142]). This lets
+downstream implementations and storage engines plug in their own mutation
logic.
+See [TableProvider::delete_from] and [TableProvider::update] for more details.
+
+[TableProvider]:
https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html
+[TableProvider::delete_from]:
https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.delete_from
+[TableProvider::update]:
https://docs.rs/datafusion/52.0.0/datafusion/datasource/trait.TableProvider.html#method.update
+
+Example:
+
+```sql
+DELETE FROM mem_table WHERE status = 'obsolete';
+```
+
+Thanks to [ethan-tyler] for the implementation and [alamb] and [adriangb] for
+reviews.
+
+[#19142]: https://github.com/apache/datafusion/pull/19142
+[ethan-tyler]: https://github.com/ethan-tyler
+
+### `CoalesceBatchesExec` Removed
+
+The standalone `CoalesceBatchesExec` operator existed to ensure batches were
+large enough for subsequent vectorized execution, and was inserted after
+filter-like operators such as `FilterExec`, `HashJoinExec`, and
+`RepartitionExec`. However, using a separate operator also blocks other
+optimizations such as pushing `LIMIT` through joins and made optimizer rules
+more complex. In this release, we integrated the coalescing into the operators
+themselves ([#18779]) using Arrow's [coalesce kernel]. This reduces plan
+complexity while keeping batch sizes efficient, and allows additional focused
+optimization work in the Arrow kernel, such as [Dandandan]'s recent work with
+filtering in [arrow-rs/#8951].
+
+Related PRs: [#18540], [#18604], [#18630], [#18972], [#19002], [#19342],
[#19239]
+Thanks to [Tim-53], [Dandandan], [jizezhang], and [feniljain] for implementing
+this feature, with reviews from [Jefffrey], [alamb], [martin-g],
+[geoffreyclaude], [milenkovicm], and [jizezhang].
+
+[#18779]: https://github.com/apache/datafusion/issues/18779
+[#18540]: https://github.com/apache/datafusion/pull/18540
+[#18604]: https://github.com/apache/datafusion/pull/18604
+[#18630]: https://github.com/apache/datafusion/pull/18630
+[#18972]: https://github.com/apache/datafusion/pull/18972
+[#19002]: https://github.com/apache/datafusion/pull/19002
+[#19342]: https://github.com/apache/datafusion/pull/19342
+[#19239]: https://github.com/apache/datafusion/pull/19239
+[Tim-53]: https://github.com/Tim-53
+[Dandandan]: https://github.com/Dandandan
+[jizezhang]: https://github.com/jizezhang
+[feniljain]: https://github.com/feniljain
+[milenkovicm]: https://github.com/milenkovicm
+[coalesce kernel]: https://docs.rs/arrow/57.2.0/arrow/compute/kernels/coalesce/
+[arrow-rs/#8951]: https://github.com/apache/arrow-rs/pull/8951
+
+## Upgrade Guide and Changelog
+
+As always, upgrading to 52.0.0 should be straightforward for most users.
Please review the
+[Upgrade Guide]
+for details on breaking changes and code snippets to help with the transition.
+For a comprehensive list of all changes, please refer to the [changelog].
+
+## About DataFusion
+
+[Apache DataFusion] is an extensible query engine, written in [Rust], that uses
+[Apache Arrow] as its in-memory format. DataFusion is used by developers to
+create new, fast, data-centric systems such as databases, dataframe libraries,
+and machine learning and streaming applications. While [DataFusion's primary
+design goal] is to accelerate the creation of other data-centric systems, it
+provides a reasonable experience directly out of the box as a [dataframe
+library], [Python library], and [command-line SQL tool].
+
+[apache datafusion]: https://datafusion.apache.org/
+[rust]: https://www.rust-lang.org/
+[apache arrow]: https://arrow.apache.org
+[DataFusion's primary design goal]:
https://datafusion.apache.org/user-guide/introduction.html#project-goals
+[dataframe library]: https://datafusion.apache.org/user-guide/dataframe.html
+[python library]: https://datafusion.apache.org/python/
+[command-line SQL tool]: https://datafusion.apache.org/user-guide/cli/
+[Upgrade Guide]:
https://datafusion.apache.org/library-user-guide/upgrading.html
+
+## How to Get Involved
+
+DataFusion is not a project built or driven by a single person, company, or
+foundation. Rather, our community of users and contributors works together to
+build a shared technology that none of us could have built alone.
+
+If you are interested in joining us, we would love to have you. You can try out
+DataFusion on some of your own data and projects and let us know how it goes,
+contribute suggestions, documentation, bug reports, or a PR with documentation,
+tests, or code. A list of open issues suitable for beginners is [here], and you
+can find out how to reach us on the [communication doc].
+
+[here]:
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]:
https://datafusion.apache.org/contributor-guide/communication.html
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]