Re: [PR] Add blog post for DataFusion 51.0.0 [datafusion-site]

via GitHub Wed, 19 Nov 2025 16:38:11 -0800


2010YOUY01 commented on code in PR #124:
URL: https://github.com/apache/datafusion-site/pull/124#discussion_r2543966730



##########
content/blog/2025-11-25-datafusion-51.0.0.md:
##########
@@ -0,0 +1,321 @@
+---
+layout: post
+title: Apache DataFusion 51.0.0 Released
+date: 2025-11-25
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+## Introduction
+
+We are proud to announce the release of [DataFusion 51.0.0]. This post 
highlights
+some of the major improvements since [DataFusion 50.0.0]. The complete list of
+changes is available in the [changelog]. Thanks to the [128 contributors] for
+making this release possible.
+
+[DataFusion 51.0.0]: https://crates.io/crates/datafusion/51.0.0
+[DataFusion 50.0.0]: 
https://datafusion.apache.org/blog/2025/09/29/datafusion-50.0.0/
+[changelog]: 
https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md
+[128 contributors]: 
https://github.com/apache/datafusion/blob/branch-51/dev/changelog/51.0.0.md#credits
+
+## Performance Improvements 🚀
+
+<img
+src="/blog/images/datafusion-51.0.0/performance_over_time_clickbench.png"
+width="100%"
+class="img-responsive"
+alt="Performance over time"
+/>
+
+TODO: update this image
+
+
+### Faster `CASE` expression evaluation
+
+This release builds on the [CASE performance epic] with significant 
improvements.
+Expressions short‑circuit earlier, reuse partial results, and avoid unnecessary
+scattering, speeding up common ETL patterns. Thanks to [pepijnve], [chenkovsky]
+and [petern48] for leading this effort. We hope to share more details on our
+implementation in a future post.
+
+[pepijnve]: https://github.com/pepijnve
+[chenkovsky]: https://github.com/chenkovsky
+[petern48]: https://github.com/petern48
+
+**Fewer object store round-trips for Parquet by Default**
+
+DataFusion now sets a default `metadata_size_hint` for Parquet scans
+([#18118]), avoiding the extra
+“last 8‑byte” request many clouds require to read file footers. Remote scans
+typically drop from five requests to four per file, cutting latency and 
transfer
+costs without any application changes. Thanks to [zhuqi-lucas] for leading this
+effort.
+
+### Faster Parquet metadata parsing
+
+DataFusion 51 also includes the latest Parquet reader improvements from
+[Arrow Rust 57.0.0], delivering faster Parquet metadata parsing. This is
+especially beneficial for workloads with many small Parquet files and scenarios
+where startup time or low latency is important. Thanks to upstream work by
+[etseidl] and [jhorstmann] for leading this effort.
+
+<img 
+  src="/blog/images/datafusion-51.0.0/arrow-57-metadata-parsing.png"
+  width="100%" 
+  class="img-responsive" 
+  alt="Metadata Parsing Performance Improvements in Arrow/Parquet 57" 
+/>
+
+
+[Arrow Rust 57.0.0]: https://arrow.apache.org/blog/2025/10/30/arrow-rs-57.0.0/
+
+### Better Defaults for Remote Parquet Reads
+
+DataFusion by default now fetches the last 512KB (configurable) of Parquet 
files
+so the first request usually includes the full footer ([#18118]). This will
+typically avoid two distinct I/O requests for each Parquet file. While this
+setting has existed in DataFusion for many years, it was not previously enabled
+by default. Users can tune the number of bytes fetched in the initial I/O
+request via the `datafusion.execution.parquet.metadata_size_hint` [config 
setting]. Thanks to
+[zhuqi-lucas] for leading this effort.
+
+[config setting]: https://datafusion.apache.org/user-guide/configs.html
+
+
+## New Features ✨
+
+### Decimal32/Decimal64 support
+
+The new Arrow types `Decimal32` and `Decimal64` are now supported in DataFusion
+([#17501]), including aggregations such as `SUM`, `AVG`, `MIN/MAX`, and window
+functions. Thanks to [AdamGS] for leading this effort.
+
+
+### SQL Pipe Operators
+
+DataFusion now supports the SQL pipe operator syntax
+([#17278]), enabling inline transforms such as:
+
+```sql
+SELECT * FROM t
+|> WHERE a > 10
+|> ORDER BY b
+|> LIMIT 5;
+```
+
+This syntax, [popularized by Google BigQuery], keeps multi-step 
transformations concise while preserving regular
+SQL semantics. Thanks to [simonvandel] for leading this effort.
+
+[popularized by Google BigQuery]: 
https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax
+
+### I/O Profiling in `datafusion-cli`
+
+[datafusion-cli] now has built-in instrumentation to trace object store calls
+([#17207]). Toggle profiling
+with the [\object_store_profiling command] and inspect the exact `GET`/`LIST` 
requests issued during
+query execution:
+
+[datafusion-cli]: https://datafusion.apache.org/user-guide/cli/
+[\object_store_profiling command]: 
https://datafusion.apache.org/user-guide/cli/usage.html#commands
+
+```sql
+DataFusion CLI v51.0.0
+> \object_store_profiling trace
+ObjectStore Profile mode set to Trace
+> select count(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
++----------+
+| count(*) |
++----------+
+| 1000000  |
++----------+
+1 row(s) fetched.
+Elapsed 0.367 seconds.
+
+Object Store Profiling
+Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
+2025-11-19T21:10:43.476121+00:00 operation=Head duration=0.069763s 
path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.545903+00:00 operation=Head duration=0.025859s 
path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.571768+00:00 operation=Head duration=0.025684s 
path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.597463+00:00 operation=Get duration=0.034194s size=524288 
range: bytes=174440756-174965043 
path=hits_compatible/athena_partitioned/hits_1.parquet
+2025-11-19T21:10:43.705821+00:00 operation=Head duration=0.022029s 
path=hits_compatible/athena_partitioned/hits_1.parquet
+
+Summaries:
++-----------+----------+-----------+-----------+-----------+-----------+-------+
+| Operation | Metric   | min       | max       | avg       | sum       | count 
|
++-----------+----------+-----------+-----------+-----------+-----------+-------+
+| Get       | duration | 0.034194s | 0.034194s | 0.034194s | 0.034194s | 1     
|
+| Get       | size     | 524288 B  | 524288 B  | 524288 B  | 524288 B  | 1     
|
+| Head      | duration | 0.022029s | 0.069763s | 0.035834s | 0.143335s | 4     
|
+| Head      | size     |           |           |           |           | 4     
|
++-----------+----------+-----------+-----------+-----------+-----------+-------+
+```
+
+This makes it far easier to diagnose slow remote scans and validate caching
+strategies. Thanks to [BlakeOrth] for leading this effort.
+
+### `DESCRIBE <query>`
+
+`DESCRIBE` now works on arbitrary queries, returning the schema instead
+of being an alias for `EXPLAIN` 
([#18234](https://github.com/apache/datafusion/issues/18234)). This brings 
DataFusion in line with engines
+like DuckDB and makes it easy to inspect the output schema of queries
+without executing them.
+
+
+For example:
+
+```sql
+DataFusion CLI v51.0.0
+> create table t(a int, b varchar, c float) as values (1, 'a', 2.0);
+0 row(s) fetched.
+Elapsed 0.002 seconds.
+
+> DESCRIBE SELECT a, b, SUM(c) FROM t GROUP BY a, b;
+
++-------------+-----------+-------------+
+| column_name | data_type | is_nullable |
++-------------+-----------+-------------+
+| a           | Int32     | YES         |
+| b           | Utf8View  | YES         |
+| sum(t.c)    | Float64   | YES         |
++-------------+-----------+-------------+
+3 row(s) fetched.
+```
+
+
+### Named arguments in SQL functions
+
+DataFusion now understands [PostgreSQL-style named arguments] (`param => 
value`)
+for scalar, aggregate, and window functions 
([#17379](https://github.com/apache/datafusion/issues/17379)). You can mix 
positional and named
+arguments in any order, and error messages now list parameter names to make
+diagnostics clearer. UDF authors can also expose parameter names so their
+functions benefit from the same syntax.
+
+[PostgreSQL-style named arguments]: 
https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html
+
+For example, you can pass arguments to functions like this:
+```sql
+SELECT power(exponent => 3.0, base => 2.0);
+```
+
+### Metrics improvement
+
+The output of [EXPLAIN ANALYZE] has been improved to include more metrics
+about execution time and memory usage of each operator in the query plan.
+Read about these new metrics in the [metrics user guide].
+

Review Comment:
   ```suggestion
   
   The `51.0.0` release adds:
   
   - **Configuration**: adds a new option `datafusion.explain.analyze_level`, 
which can be set to `summary` for a concise output or `dev` for the full set of 
metrics (the previous default).
   - **For all major operators**: adds `output_bytes`, reporting how many bytes 
of data each operator produces.
   - **FilterExec**: adds a `selectivity` metric (`output_rows / input_rows`) 
to show how effective the filter is.
   - **AggregateExec**: 
     - adds detailed timing metrics for group-ID computation, aggregate 
argument evaluation, aggregation work, and emitting final results.
     - adds a `reduction_factor` metric (`output_rows / input_rows`) to show 
how much grouping reduces the data.
   - **NestedLoopJoinExec**: adds a `selectivity` metric (`output_rows / 
(left_rows * right_rows)`) to show how many combinations actually pass the join 
condition.
   - Several display formatting improvements were added to make `EXPLAIN 
ANALYZE` output easier to read.
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add blog post for DataFusion 51.0.0 [datafusion-site]

Reply via email to