Re: [PR] [SPARK-56944] [SQL] Trim aliases from ListAgg expression subtree [spark]

via GitHub Wed, 20 May 2026 08:45:48 -0700


cloud-fan commented on code in PR #55984:
URL: https://github.com/apache/spark/pull/55984#discussion_r3275268710



##########
sql/core/src/test/resources/sql-tests/inputs/listagg.sql:
##########
@@ -61,3 +61,15 @@ SELECT listagg(DISTINCT col1) WITHIN GROUP (ORDER BY col1, 
col2) FROM df;
 SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES 
(cast(1.1 as double)), (cast(2.2 as double)), (cast(2.2 as double)), (cast(3.3 
as double)) AS t(col);
 SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES 
(cast(1.0 as float)), (cast(2.0 as float)), (cast(2.0 as float)) AS t(col);
 SELECT listagg(DISTINCT col, ',') WITHIN GROUP (ORDER BY col) FROM VALUES 
(TIMESTAMP'2024-01-01 10:00:00'), (TIMESTAMP'2024-01-02 12:00:00'), 
(TIMESTAMP'2024-01-01 10:00:00') AS t(col);
+
+-- LISTAGG with semi-structured extract (parser wraps v:a in Alias with fresh 
ExprId)
+-- Tests that isOrderCompatible strips Alias wrappers before comparing via 
semanticEquals
+SELECT listagg(DISTINCT v:a::string, ',') WITHIN GROUP (ORDER BY v:a::string) 
FROM (SELECT parse_json('{"a": "x"}') v UNION ALL SELECT parse_json('{"a": 
"y"}') UNION ALL SELECT parse_json('{"a": "x"}'));

Review Comment:
   `SQLQueryTestSuite` only runs the legacy analyzer here, where 
`CleanupAliases` has already stripped the parser-introduced `Alias` before 
`CheckAnalysis` calls `isOrderCompatible`. So these golden files pass on master 
without the fix too — they don't actually exercise the single-pass path the PR 
is fixing.
   
   Consider adding a targeted single-pass unit test (e.g. in 
`AggregateExpressionResolverSuite` or alongside `ResolverRunner`) that 
constructs `LISTAGG(DISTINCT v:a::string) WITHIN GROUP (ORDER BY v:a::string)` 
and goes through `Resolver` directly, so the regression is pinned at the layer 
where it actually manifested.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/collect.scala:
##########
@@ -588,7 +589,8 @@ case class ListAgg(
     if (someOrder.isEmpty) {
       return true
     }
-    if (someOrder.size == 1 && someOrder.head.child.semanticEquals(child)) {
+    if (someOrder.size == 1 &&
+        trimAliases(someOrder.head.child).semanticEquals(trimAliases(child))) {

Review Comment:
   The PR description says "trim aliases from `ListAgg` expression subtree", 
but `checkOrderValueDeterminism` (line 684) still does 
`orderExpressions.head.child.semanticEquals(castChild)` without trimming. With 
`spark.sql.listagg.allowDistinctCastWithOrder.enabled=true` (default), a query 
like
   
   ```sql
   SELECT listagg(DISTINCT v:a::string, ',') WITHIN GROUP (ORDER BY v:a) FROM 
...
   ```
   
   reaches that path, and the two analyzers still produce different errors:
   - single-pass (aliases still present): the `Alias`-vs-`Alias` 
`semanticEquals` is false because the ExprIds differ, so we return 
`NonDeterministicMismatch` and throw `functionAndOrderExpressionMismatchError`.
   - fixed-point (post-`CleanupAliases`): `semanticEquals` is true, then 
`isCastEqualityPreserving(VariantType)` is false, so we throw 
`functionAndOrderExpressionUnsafeCastError`.
   
   Should this comparison also wrap both sides in `trimAliases`? Otherwise the 
single-pass/fixed-point divergence is only partly closed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56944] [SQL] Trim aliases from ListAgg expression subtree [spark]

Reply via email to