[ https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-31519: ---------------------------------- Affects Version/s: 2.3.4 > Cast in having aggregate expressions returns the wrong result > ------------------------------------------------------------- > > Key: SPARK-31519 > URL: https://issues.apache.org/jira/browse/SPARK-31519 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.4, 2.4.5, 3.0.0 > Reporter: Yuanjian Li > Assignee: Yuanjian Li > Priority: Major > Labels: correctness > Fix For: 2.4.6, 3.0.0 > > > Cast in having aggregate expressions returns the wrong result. > See the below tests: > {code:java} > scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") > res0: org.apache.spark.sql.DataFrame = [] > scala> val query = """ > | select sum(a) as b, '2020-01-01' as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+----------+ > | b| fake| > +---+----------+ > | 2|2020-01-01| > +---+----------+ > scala> val query = """ > | select sum(a) as b, cast('2020-01-01' as date) as fake > | from t > | group by b > | having b > 10;""" > scala> spark.sql(query).show() > +---+----+ > | b|fake| > +---+----+ > +---+----+ > {code} > The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING > query, and Spark has a special analyzer rule ResolveAggregateFunctions to > resolve the aggregate functions and grouping columns in the Filter operator. > > It works for simple cases in a very tricky way as it relies on rule execution > order: > 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes > inside aggregate functions, but the function itself is still unresolved as > it's an UnresolvedFunction. This stops resolving the Filter operator as the > child Aggrege operator is still unresolved. > 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege > operator resolved. > 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child > is a resolved Aggregate. This rule can correctly resolve the grouping columns. > > In the example query, I put a CAST, which needs to be resolved by rule > ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step > 3 as the Aggregate operator is unresolved at that time. Then the analyzer > starts next round and the Filter operator is resolved by ResolveReferences, > which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org