Re: [PR] DataFusion 52 release post [datafusion-site]

via GitHub Sat, 24 Jan 2026 06:33:46 -0800


zhuqi-lucas commented on code in PR #135:
URL: https://github.com/apache/datafusion-site/pull/135#discussion_r2724213539



##########
content/blog/2026-01-08-datafusion-52.0.0.md:
##########
@@ -0,0 +1,405 @@
+---
+layout: post
+title: Apache DataFusion 52.0.0 Released
+date: 2026-01-08
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+We are proud to announce the release of [DataFusion 52.0.0]. This post 
highlights
+some of the major improvements since [DataFusion 51.0.0]. The complete list of
+changes is available in the [changelog]. Thanks to the [121 contributors] for
+making this release possible.
+
+
+[DataFusion 52.0.0]: https://crates.io/crates/datafusion/52.0.0
+[DataFusion 51.0.0]: 
https://datafusion.apache.org/blog/2025/11/25/datafusion-51.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md
+[121 contributors]: 
https://github.com/apache/datafusion/blob/branch-52/dev/changelog/52.0.0.md#credits
+
+## Performance Improvements 🚀
+
+We continue to make significant performance improvements in DataFusion as 
explained below.
+
+### Faster `CASE` Expressions
+
+DataFusion 52 has lookup-table-based evaluation for certain `CASE` expressions
+to avoid repeated evaluation for accelerating common ETL patterns such as
+
+```sql
+CASE company
+    WHEN 1 THEN 'Apple'
+    WHEN 5 THEN 'Samsung'
+    WHEN 2 THEN 'Motorola'
+    WHEN 3 THEN 'LG'
+    ELSE 'Other'
+END
+```
+
+This is the final work in our `CASE` performance epic ([#18075]), which has
+improved `CASE` evaluation significantly. Related PRs [#18183]. Thanks to
+[rluvaton] and [pepijnve] for the implementation.
+
+[rluvaton]: https://github.com/rluvaton
+[pepijnve]: https://github.com/pepijnve
+
+
+[#18075]: https://github.com/apache/datafusion/issues/18075
+[#18183]: https://github.com/apache/datafusion/pull/18183
+
+### `MIN`/`MAX` Aggregate Dynamic Filters
+
+DataFusion now creates dynamic filters for queries with `MIN`/`MAX` aggregates
+that have filters, but no `GROUP BY`. These dynamic filters are used during 
scan
+to prune files and rows as tighter bounds are discovered during execution, as
+explained in the [Dynamic Filtering blog]. For example, the following query:
+
+```sql
+SELECT min(l_shipdate)
+FROM lineitem
+WHERE l_returnflag = 'R';
+```
+
+Is now executed like this  
+```sql
+SELECT min(l_shipdate)
+FROM lineitem
+--  '__current_min' is updated dynamically during execution
+WHERE l_returnflag = 'R' AND l_shipdate > __current_min;
+```
+
+
+Thanks to [2010YOUY01] for implementing this feature, with reviews from
+[martin-g], [adriangb], and [LiaCastaneda]. Related PRs: [#18644]
+
+[#18644]: https://github.com/apache/datafusion/pull/18644
+[2010YOUY01]: https://github.com/2010YOUY01
+[martin-g]: https://github.com/martin-g
+[adriangb]: https://github.com/adriangb
+[LiaCastaneda]: https://github.com/LiaCastaneda
+
+### New Merge Join
+
+DataFusion 52 includes a rewrite of the sort-merge join (SMJ) operator, with
+speedups of three orders of magnitude in some pathological cases such as the
+case in [#18487], which also affected [Apache Comet] workloads. Benchmarks in
+[#18875] show dramatic gains for TPC-H Q21 (minutes to milliseconds) while
+leaving other queries unchanged or modestly faster. Thanks to [mbutrovich] for
+the implementation and reviews from [Dandandan].
+
+[#18487]: https://github.com/apache/datafusion/issues/18487
+[#18875]: https://github.com/apache/datafusion/pull/18875
+[Apache Comet]: https://datafusion.apache.org/comet/
+[mbutrovich]: https://github.com/mbutrovich
+
+
+### Caching Improvements
+
+This release also includes several additional caching improvements.
+
+A new statistics cache for File Metadata avoids repeatedly (re)calculating
+statistics for files. This significantly improves planning time
+for certain queries. You can see the contents of the new cache using the
+[statistics_cache] function in the CLI:
+
+[statistics_cache]: 
https://datafusion.apache.org/user-guide/cli/functions.html#statistics-cache
+
+
+```sql
+select * from statistics_cache();
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+| path             | file_modified       | file_size_bytes | e_tag             
     | version | num_rows        | num_columns | table_size_bytes   | 
statistics_size_bytes |
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+| .../hits.parquet | 2022-06-25T22:22:22 | 14779976446     | 
0-5e24d1ee16380-370f48 | NULL    | Exact(99997497) | 105         | 
Exact(36445943240) | 0                     |
++------------------+---------------------+-----------------+------------------------+---------+-----------------+-------------+--------------------+-----------------------+
+```
+Thanks to [bharath-techie] and [nuno-faria] for implementing the statistics 
cache,
+with reviews from [martin-g], [alamb], and [alchemist51].
+Related PRs: [#18971], [#19054]
+
+[#18971]: https://github.com/apache/datafusion/pull/18971
+[#19054]: https://github.com/apache/datafusion/pull/19054
+[bharath-techie]: https://github.com/bharath-techie
+[nuno-faria]: https://github.com/nuno-faria
+[martin-g]: https://github.com/martin-g
+[alchemist51]: https://github.com/alchemist51
+
+
+A prefix-aware list-files cache accelerates evaluating partition predicates for
+Hive partitioned tables.
+
+```sql
+-- Read the hive partitioned dataset from Overture Maps (100s of Parquet files)
+CREATE EXTERNAL TABLE overturemaps
+STORED AS PARQUET LOCATION 's3://overturemaps-us-west-2/release/2025-12-17.0/';
+-- Find all files where the path contains `theme=base without requiring 
another LIST call
+select count(*) from overturemaps where theme='base';
+```
+
+You can see the
+contents of the new cache using the [list_files_cache] function in the CLI:
+
+[list_files_cache]: 
https://datafusion.apache.org/user-guide/cli/functions.html#list-files-cache
+
+```sql
+create external table overturemaps
+stored as parquet
+location 
's3://overturemaps-us-west-2/release/2025-12-17.0/theme=base/type=infrastructure';
+0 row(s) fetched.
+> select table, path, metadata_size_bytes, expires_in, 
unnest(metadata_list)['file_size_bytes'] as file_size_bytes, 
unnest(metadata_list)['e_tag'] as e_tag from list_files_cache() limit 10;
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+| table        | path                                                | 
metadata_size_bytes | expires_in                        | file_size_bytes | 
e_tag                                 |
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750    
            | 0 days 0 hours 0 mins 25.264 secs | 999055952       | 
"35fc8fbe8400960b54c66fbb408c48e8-60" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750    
            | 0 days 0 hours 0 mins 25.264 secs | 975592768       | 
"8a16e10b722681cdc00242564b502965-59" |
+...
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750    
            | 0 days 0 hours 0 mins 25.264 secs | 1016732378      | 
"6d70857a0473ed9ed3fc6e149814168b-61" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750    
            | 0 days 0 hours 0 mins 25.264 secs | 991363784       | 
"c9cafb42fcbb413f851691c895dd7c2b-60" |
+| overturemaps | release/2025-12-17.0/theme=base/type=infrastructure | 2750    
            | 0 days 0 hours 0 mins 25.264 secs | 1032469715      | 
"7540252d0d67158297a67038a3365e0f-62" |
++--------------+-----------------------------------------------------+---------------------+-----------------------------------+-----------------+---------------------------------------+
+```
+
+Thanks to [BlakeOrth] and [Yuvraj-cyborg] for implementing the list-files 
cache work,
+with reviews from [gabotechs], [alamb], [alchemist51], [martin-g], and 
[BlakeOrth].
+Related PRs: [#18146], [#18855], [#19366], [#19298], 
+
+[Epic #17214]: https://github.com/apache/datafusion/issues/17214
+[#18146]: https://github.com/apache/datafusion/pull/18146
+[#18855]: https://github.com/apache/datafusion/pull/18855
+[#19366]: https://github.com/apache/datafusion/pull/19366
+[#19298]: https://github.com/apache/datafusion/pull/19298
+[BlakeOrth]: https://github.com/BlakeOrth
+[Yuvraj-cyborg]: https://github.com/Yuvraj-cyborg
+
+
+### Improved Hash Join Filter Pushdown
+
+Starting in DataFusion 51, filtering information from `HashJoinExec` is passed
+dynamically to scans, as explained in the [Dynamic Filtering Blog] using a
+technique referred to as [Sideways Information Passing] in Database research
+literature. The initial implementation passed min/max values for the join keys.
+DataFusion 52 extends the optimization ([#17171] / [#18393]) to pass the
+contents of the build side hash map. These filters are evaluated on the probe
+side scan to prune files, row groups, and individual rows. When the build side
+contains `20` or fewer rows (configurable) the contents of the hash map are
+transformed to an `IN` expression and used for [statistics-based pruning] which
+can avoid reading entire files or row groups that contain no matching join 
keys.
+Thanks to [adriangb] for implementing this feature, with reviews from
+[LiaCastaneda], [asolimando], [comphead], and [mbutrovich].
+
+
+[Sideways Information Passing]: 
https://dl.acm.org/doi/10.1109/ICDE.2008.4497486
+[Dynamic Filtering blog]: 
https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/#hash-join-dynamic-filters
+[statistics-based pruning]: 
https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html
+
+[#17171]: https://github.com/apache/datafusion/issues/17171
+[#18393]: https://github.com/apache/datafusion/pull/18393
+[adriangb]: https://github.com/adriangb
+[LiaCastaneda]: https://github.com/LiaCastaneda
+[asolimando]: https://github.com/asolimando
+[comphead]: https://github.com/comphead
+
+
+## Major Features ✨
+
+### Arrow IPC Stream file support
+
+DataFusion can now read Arrow IPC stream files ([#18457]). This expands
+interoperability with systems that emit Arrow streams directly, making it
+simpler to ingest Arrow-native data without conversion. Thanks to 
[corasaurus-hex]
+for implementing this feature, with reviews from [martin-g], [Jefffrey],
+[jdcasale], [2010YOUY01], and [timsaucer].
+
+```sql
+CREATE EXTERNAL TABLE ipc_events
+STORED AS ARROW
+LOCATION 's3://bucket/events.arrow';
+```
+
+Related PRs: [#18457]
+
+[#18457]: https://github.com/apache/datafusion/pull/18457
+[corasaurus-hex]: https://github.com/corasaurus-hex
+[Jefffrey]: https://github.com/Jefffrey
+[jdcasale]: https://github.com/jdcasale
+[2010YOUY01]: https://github.com/2010YOUY01
+[timsaucer]: https://github.com/timsaucer
+
+### More Extensible SQL Planning with `RelationPlanner`
+
+DataFusion now has an API for extending the SQL planner for relations, as
+explained in the [Extending SQL in DataFusion Blog]. In addition to the 
existing
+expression and types extension points, this new API now allows extending `FROM`
+clauses. Using these APIs it is straightforward to provide SQL support for
+almost any dialect, including vendor-specific syntax. Example use cases 
include:
+
+
+```sql
+-- Postgres-style JSON operators
+SELECT payload->'user'->>'id' FROM logs;
+-- MySQL-specific types
+SELECT DATETIME '2001-01-01 18:00:00';
+-- Statistical sampling
+SELECT * FROM sensor_data TABLESAMPLE BERNOULLI(10 PERCENT);
+```
+[Extending SQL in DataFusion Blog]: 
https://datafusion.apache.org/blog/2026/01/12/extending-sql/
+
+Thanks to [geoffreyclaude] for implementing relation planner extensions, and to
+[theirix], [alamb], [NGA-TRAN], and [gabotechs] for reviews and feedback on the
+design. Related PRs: [#17843]
+
+[#17843]: https://github.com/apache/datafusion/pull/17843
+[geoffreyclaude]: https://github.com/geoffreyclaude
+[theirix]: https://github.com/theirix
+[alamb]: https://github.com/alamb
+[NGA-TRAN]: https://github.com/NGA-TRAN
+[gabotechs]: https://github.com/gabotechs
+
+### Expression Evaluation Pushdown to Scans
+
+DataFusion now pushes down expression evaluation into TableProviders using 
+[PhysicalExprAdapter], replacing the older SchemaAdapter approach ([#14993],
+[#16800]). Predicates and expressions can now be customized for each
+individual file schema, opening additional optimization such as support for
+[Variant shredding]. Thanks to [adriangb] for implementing PhysicalExprAdapter
+and reworking pushdown to use it. Related PRs: [#18998], [#19345]
+
+[#14993]: https://github.com/apache/datafusion/issues/14993
+[#16800]: https://github.com/apache/datafusion/issues/16800
+[#18998]: https://github.com/apache/datafusion/pull/18998
+[#19345]: https://github.com/apache/datafusion/pull/19345
+[kosiew]: https://github.com/kosiew
+[Variant shredding]: https://github.com/apache/datafusion/issues/16116
+[PhysicalExprAdapter]: 
https://docs.rs/datafusion/52.0.0/datafusion/physical_expr_adapter/trait.PhysicalExprAdapter.html
+
+### Sort Pushdown to Scans
+
+DataFusion can now push sorts into data sources ([#10433], [#19064]).
+This allows table provider implementations to take better advantage of 
existing sort 
+information based on the query pattern, such as to reorder files or row groups 
to 
+satisfy `LIMIT` clauses more
+efficiently. Thanks to [zhuqi-lucas] and [xudong963] for this feature. 

Review Comment:
   We may add something like this because the most important for the PR phase 1 
is to speed up topk with reverse ordered file, but LGTM also if we don't change 
it.
   
   DataFusion can now push sort requirements into Parquet sources, enabling 
   **reverse row group and file scanning** when queries request data in the 
   opposite order of the file's natural sort. This allows TopK queries with 
   `LIMIT` to scan from the end of sorted files and terminate early, achieving 
   **~30x performance improvement** for such queries on pre-sorted data. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] DataFusion 52 release post [datafusion-site]

Reply via email to