Wenchen Fan created SPARK-46536: ----------------------------------- Summary: Support GROUP BY calendar_interval_type Key: SPARK-46536 URL: https://issues.apache.org/jira/browse/SPARK-46536 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wenchen Fan
Currently, Spark GROUP BY only allows orderable data types, otherwise the plan analysis fails: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExprUtils.scala#L197-L203] However, this is too strict as GROUP BY only cares about equality, not ordering. The CalendarInterval type is not orderable (1 month and 30 days, we don't know which one is larger), but has well-defined equality. In fact, we already support `SELECT DISTINCT calendar_interval_type` in some cases (when hash aggregate is picked by the planner). The proposal here is to officially support calendar interval type in GROUP BY. We should relax the check inside `CheckAnalysis`, and make `CalendarInterval` implements `Comparable` using natural ordering (compare months first, then days, then seconds), and test with both hash aggregate and sort aggregate. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org