Mithun Radhakrishnan created SPARK-32038: --------------------------------------------
Summary: Regression in handling NaN values in COUNT(DISTINCT) Key: SPARK-32038 URL: https://issues.apache.org/jira/browse/SPARK-32038 Project: Spark Issue Type: Bug Components: Optimizer, SQL Affects Versions: 3.0.0 Reporter: Mithun Radhakrishnan There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an illustration: {code:scala} case class Test( uid:String, score:Float) val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001) val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff) val rows = Seq( Test("mithunr", Float.NaN), Test("mithunr", POS_NAN_1), Test("mithunr", POS_NAN_2), Test("abellina", 1.0f), Test("abellina", 2.0f) ).toDF.createOrReplaceTempView("mytable") spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show {code} Here are the results under Spark 3.0.0: {code:title=Spark 3.0.0 (single aggregation)} +--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 3| +--------+---------------------+ {code} Note that the count against {{mithunr}} is {{3}}, accounting for each distinct value for {{NaN}}. The right results are returned when another aggregation is added to the GBY: {code:scala|title=Spark 3.0.0 (multiple aggregations)} scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show +--------+---------------------+----------+ | uid|count(DISTINCT score)|max(score)| +--------+---------------------+----------+ |abellina| 2| 2.0| | mithunr| 1| NaN| +--------+---------------------+----------+ {code} Also, note that Spark 2.4.6 normalizes the `DISTINCT` expression correctly: {code:scala|title=Spark 2.4.6} scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show +--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 1| +--------+---------------------+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org