[ https://issues.apache.org/jira/browse/SPARK-30212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kernel Force updated SPARK-30212: --------------------------------- Summary: COUNT(DISTINCT) window function should be supported (was: Could not use COUNT(DISTINCT) window function in SparkSQL) > COUNT(DISTINCT) window function should be supported > --------------------------------------------------- > > Key: SPARK-30212 > URL: https://issues.apache.org/jira/browse/SPARK-30212 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4 > Environment: Spark 2.4.4 > Scala 2.11.12 > Hive 2.3.6 > Reporter: Kernel Force > Priority: Major > Labels: SQL, distinct, window_function > > Suppose we have a typical table in Hive like below: > {code:sql} > CREATE TABLE DEMO_COUNT_DISTINCT ( > demo_date string, > demo_id string > ); > {code} > {noformat} > +--------------------------------+------------------------------+ > | demo_count_distinct.demo_date | demo_count_distinct.demo_id | > +--------------------------------+------------------------------+ > | 20180301 | 101 | > | 20180301 | 102 | > | 20180301 | 103 | > | 20180401 | 201 | > | 20180401 | 202 | > +--------------------------------+------------------------------+ > {noformat} > Now I want to count distinct number of DEMO_DATE but also reserve every > columns' data in each row. > So I use COUNT(DISTINCT) window function like below in Hive beeline and it > work: > {code:sql} > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T; > {code} > {noformat} > +--------------+------------+-------------+ > | t.demo_date | t.demo_id | uniq_dates | > +--------------+------------+-------------+ > | 20180401 | 202 | 2 | > | 20180401 | 201 | 2 | > | 20180301 | 103 | 2 | > | 20180301 | 102 | 2 | > | 20180301 | 101 | 2 | > +--------------+------------+-------------+ > {noformat} > But when I came to SparkSQL, it threw exception even if I run the same SQL. > {code:sql} > spark.sql(""" > SELECT T.*, COUNT(DISTINCT T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Distinct window functions are not > supported: count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), > unboundedfollowing$()));; > Project [demo_date#1, demo_id#2, UNIQ_DATES#0L] > +- Project [demo_date#1, demo_id#2, UNIQ_DATES#0L, UNIQ_DATES#0L] > +- Window [count(distinct DEMO_DATE#1) windowspecdefinition(null, > specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) > AS UNIQ_DATES#0L], [null] > +- Project [demo_date#1, demo_id#2] > +- SubqueryAlias `T` > +- SubqueryAlias `default`.`demo_count_distinct` > +- HiveTableRelation `default`.`demo_count_distinct`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [demo_date#1, demo_id#2] > {noformat} > Then I try to use countDistinct function but also got exceptions. > {code:sql} > spark.sql(""" > SELECT T.*, countDistinct(T.DEMO_DATE) OVER(PARTITION BY NULL) UNIQ_DATES > FROM DEMO_COUNT_DISTINCT T > """).show > {code} > {noformat} > org.apache.spark.sql.AnalysisException: Undefined function: 'countDistinct'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 2 pos 12 > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions$$anonfun$apply$15$$anonfun$applyOrElse$49.apply(Analyzer.scala:1279) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53) > ...... > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org