kevinjqliu commented on code in PR #171:
URL: https://github.com/apache/datafusion-site/pull/171#discussion_r3213683674
##########
content/blog/2026-05-07-datafusion-comet-0.16.0.md:
##########
@@ -0,0 +1,247 @@
+---
+layout: post
+title: Apache DataFusion Comet 0.16.0 Release
+date: 2026-05-07
+author: pmc
+categories: [subprojects]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+[TOC]
+
+The Apache DataFusion PMC is pleased to announce version 0.16.0 of the
[Comet](https://datafusion.apache.org/comet/) subproject.
+
+Comet is an accelerator for Apache Spark that translates Spark physical plans
to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.
+
+This release covers approximately three weeks of development work and is the
result of merging 115 PRs from 17
+contributors. See the [change log] for more information.
+
+[change log]:
https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.16.0.md
+
+## Expanded Spark 4 Support
+
+Spark 4 is a major theme of this release. Comet now ships first-class support
for both Spark 4.0.2 and
+Spark 4.1.1, with dedicated Maven profiles, shim sources, and CI matrices for
each.
+
+- **Spark 4.1.1**: New `spark-4.1` Maven profile and shim sources, with
Comet's PR test matrix and Spark SQL
+ test suites enabled against Spark 4.1.1. The default Maven profile has been
updated to Spark 4.1 / Scala 2.13
+ to reflect that this is now the primary development target.
+- **Shared 4.x shims**: Identical pieces of the Spark 4.0 and 4.1 shims have
been consolidated into a shared
+ `spark-4.x` source tree, reducing duplication as more 4.x minor versions
land.
+- **Spark 4.0 / JDK 21**: Added a Spark 4.0 / JDK 21 CI profile to validate
Comet on the JDK most users are
+ expected to deploy with Spark 4.
+
+### Adapting to Spark 4 Behavior Changes
+
+Spark 4 introduced a number of type, planner, and on-disk format changes
relative to Spark 3.x. Several
+correctness fixes this release bring Comet's behavior in line with these
changes:
+
+- **`Variant` type (new in Spark 4.0)**: Spark 4.0 added a new `Variant` data
type for semi-structured
+ data. Comet does not yet read the shredded Variant on-disk format natively,
and delegates these scans
+ to Spark.
+- **String collation (new in Spark 4.0)**: Spark 4.0 added collation support
for `StringType`. Comet's
+ native operators do not yet implement non-default collations, so hash join
and sort-merge join reject
+ collated string join keys, and shuffle, sort, and aggregate fall back to
Spark when keys carry a
+ non-default collation.
+- **Wider `TimestampNTZType` usage**: Spark 4 uses `TimestampNTZType`
(timestamp without time zone) in
+ more places than 3.x — for example, in expression return types and as the
inferred type for some
+ literal forms. Comet adds support this cycle for cast to and from
`timestamp_ntz`, cast from string to
+ `timestamp_ntz`, and `unix_timestamp` over `TimestampNTZType` inputs.
+- **`to_json` and `array_compact` (Spark 4.0)**: Spark 4.0 adjusted output
formatting and return-type
+ metadata for these expressions; Comet now matches the new behavior.
+- **BloomFilter V2 (new in Spark 4.1)**: Spark 4.1 introduced a new
BloomFilter binary format with
+ different bit-scattering. Comet now reads this format so that runtime
filters produced by Spark 4.1
+ remain usable in native execution.
+- **Spark 4.1.1 analyzer refinements**: Spark 4.1.1 changed how struct
projections handle the case where
+ every requested child field is missing from the Parquet file, and how
`allowDecimalPrecisionLoss`
+ flows through the `DecimalPrecision` rule. Comet now preserves parent-struct
nullness in the first
+ case and the stored `allowDecimalPrecisionLoss` flag in the second.
+
+Most of these behavior differences were caught because **Comet runs the full
Apache Spark SQL test suite
+against each supported Spark version** — 3.4.3, 3.5.8, 4.0.2, and 4.1.1 — as
part of CI. Running Spark's
+own correctness tests through Comet's native execution path is what surfaces
semantic shifts like
+`TimestampNTZType` propagation, ANSI-driven cast and overflow changes,
BloomFilter V2 encoding, and the
+4.1.1 analyzer rule changes, often before they show up in user workloads. As
more Spark 4.x minor releases
+land, this same harness is what gives us confidence that Comet keeps up.
+
+### ANSI SQL Semantics
+
+Spark 4 enables ANSI SQL semantics by default. ANSI mode changes how
arithmetic overflow, invalid casts,
+division by zero, and similar error conditions are handled, and Spark itself
now treats this as the standard
+configuration rather than an opt-in.
+
+This is a critical area for any Spark accelerator: an engine that falls back
to vanilla Spark whenever ANSI is
+enabled effectively does not run on Spark 4 by default. **Comet implements
ANSI semantics for the expressions
+it supports natively**, including arithmetic overflow checks, ANSI cast
behavior, and `try_*` variants.
+Queries running with `spark.sql.ansi.enabled=true` continue to be accelerated
rather than falling back.
+
+See the [Comet Compatibility Guide] for details on which expressions have full
ANSI coverage.
+
+[Comet Compatibility Guide]:
https://datafusion.apache.org/comet/user-guide/latest/compatibility/index.html
+
+## Expanded Adaptive Execution Support
+
+Modern Spark plans are adaptive: AQE re-plans stages at runtime, Dynamic
Partition Pruning (DPP) prunes
+fact-table partitions based on broadcast dimension filters, and
`ReuseExchange` and `ReuseSubquery` ensure
+that a broadcast or subquery referenced in multiple places executes only once.
For star-schema workloads,
+these mechanisms are not optional. They are often the difference between a
query that reads 1% of the fact
+table and one that reads all of it.
+
+Prior to 0.16.0, Comet's native scans only partially participated in this
machinery. `CometNativeScanExec`
+(the DataFusion-based native Parquet scan) fell back to Spark entirely
whenever a DPP filter was present.
+`CometIcebergNativeScanExec` supported non-AQE DPP as of 0.15.0
+([#3349](https://github.com/apache/datafusion-comet/pull/3349)), but without
broadcast exchange reuse, so
+the DPP subquery re-executed the dimension broadcast.
+
+Comet 0.16.0 closes both gaps and aligns the native Parquet and native Iceberg
scans on a single DPP and
+subquery-resolution path:
+
+- **Non-AQE DPP for native Parquet, with broadcast exchange reuse**
+ ([#4011](https://github.com/apache/datafusion-comet/pull/4011),
+ [#4037](https://github.com/apache/datafusion-comet/pull/4037)): A new
`CometSubqueryBroadcastExec` replaces
+ Spark's `SubqueryBroadcastExec` in DPP expressions and wraps a
`CometBroadcastExchangeExec`, so
+ `ReuseExchangeAndSubquery` matches the join side and the DPP subquery and
broadcasts the dimension exactly
+ once.
+- **AQE DPP for native Parquet**
([#4112](https://github.com/apache/datafusion-comet/pull/4112)): Under AQE,
+ Spark's `PlanAdaptiveDynamicPruningFilters` cannot match Comet's broadcast
hash join and would otherwise
+ rewrite DPP to `TrueLiteral`, disabling pruning. 0.16.0 intercepts
`SubqueryAdaptiveBroadcastExec` before
+ Spark's rule runs, and applies Spark's decision tree in a Comet-aware rule
that searches both the current
+ stage and the root plan for a reusable broadcast. DPP subqueries are
registered in AQE's shared
+ `subqueryCache` so cross-plan DPP (for example, a main query and a scalar
subquery referencing the same
+ dimension) deduplicates correctly. A narrower tagging-based fallback covers
Spark 3.4, which lacks the
+ `injectQueryStageOptimizerRule` extension point.
+- **AQE DPP broadcast reuse for native Iceberg**
+ ([#4215](https://github.com/apache/datafusion-comet/pull/4215)): Lifts
`runtimeFilters` to a top-level
+ constructor field on `CometIcebergNativeScanExec` (mirroring
`BatchScanExec`), so Spark's
+ expression-rewrite passes can see and convert the DPP subquery. The same
`CometSubqueryBroadcastExec`
+ machinery from the Parquet path now handles the Iceberg case.
+- **Scalar subquery pushdown and AQE subquery reuse**
+ ([#4053](https://github.com/apache/datafusion-comet/pull/4053),
+ [SPARK-43402](https://issues.apache.org/jira/browse/SPARK-43402)):
`CometNativeScanExec` now participates
+ in scalar subquery pushdown into Parquet data filters, and in AQE-time
subquery deduplication via a new
+ `CometReuseSubquery` rule that re-applies Spark's `ReuseAdaptiveSubquery`
algorithm after Comet's node
+ replacements.
+
+**Measured impact on TPC-DS:** 78 queries previously fell back to Spark
whenever
+DPP filters were planned, running 30–50% natively. With native DPP in 0.16.0,
the same queries run 80–97%
+natively. Representative examples:
+
+| Query | Before | After |
+|-------|--------|-------|
+| q1 | 36% | 96% |
+| q4 | 31% | 95% |
+| q31 | 31% | 95% |
+| q74 | 32% | 95% |
+| q92 | 36% | 95% |
+
+Several Spark SQL DPP tests that Comet previously skipped are re-enabled to
guarantee Spark compatibility
+and prevent regressions.
+
+## Improved TPC-DS Benchmark Results
+
+TODO: side-by-side TPC-DS benchmark results comparing 0.15.0 and 0.16.0.
Review Comment:
flagging the TODO here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]