Josh Rosen created SPARK-20686:
----------------------------------

             Summary: PropagateEmptyRelation incorrectly handles aggregate 
without grouping expressions
                 Key: SPARK-20686
                 URL: https://issues.apache.org/jira/browse/SPARK-20686
             Project: Spark
          Issue Type: Bug
          Components: Optimizer, SQL
    Affects Versions: 2.1.0
            Reporter: Josh Rosen
            Assignee: Josh Rosen


The query

{code}
SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
{code}

should return a single row of output because the subquery is an aggregate 
without a group-by and thus should return a single row. However, Spark 
incorrectly returns zero rows.

This is caused by SPARK-16208, a patch which added an optimizer rule to 
propagate EmptyRelation through operators. The logic for handling aggregates is 
wrong: it checks whether aggregate expressions are non-empty for deciding 
whether the output should be empty, whereas it should be checking grouping 
expressions instead:

An aggregate with non-empty group expression will return one output row per 
group. If the input to the grouped aggregate is empty then all groups will be 
empty and thus the output will be empty. It doesn't matter whether the SELECT 
statement includes aggregate expressions since that won't affect the number of 
output rows.

If the grouping expressions are empty, however, then the aggregate will always 
produce a single output row and thus we cannot propagate the EmptyRelation.

The current implementation is incorrect (since it returns a wrong answer) and 
also misses an optimization opportunity by not propagating EmptyRelation in the 
case where a grouped aggregate has aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to