Wenchen Fan created SPARK-46536:
-----------------------------------

             Summary: Support GROUP BY calendar_interval_type
                 Key: SPARK-46536
                 URL: https://issues.apache.org/jira/browse/SPARK-46536
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Wenchen Fan


Currently, Spark GROUP BY only allows orderable data types, otherwise the plan 
analysis fails: 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203]

However, this is too strict as GROUP BY only cares about equality, not 
ordering. The CalendarInterval type is not orderable (1 month and 30 days, we 
don't know which one is larger), but has well-defined equality. In fact, we 
already support `SELECT DISTINCT calendar_interval_type` in some cases (when 
hash aggregate is picked by the planner).

The proposal here is to officially support calendar interval type in GROUP BY. 
We should relax the check inside `CheckAnalysis`, and make `CalendarInterval` 
implements `Comparable` using natural ordering (compare months first, then 
days, then seconds), and test with both hash aggregate and sort aggregate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to