[ https://issues.apache.org/jira/browse/SPARK-32038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mithun Radhakrishnan updated SPARK-32038: ----------------------------------------- Description: There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an illustration: {code:scala} case class Test( uid:String, score:Float) val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001) val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff) val rows = Seq( Test("mithunr", Float.NaN), Test("mithunr", POS_NAN_1), Test("mithunr", POS_NAN_2), Test("abellina", 1.0f), Test("abellina", 2.0f) ).toDF.createOrReplaceTempView("mytable") spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show {code} Here are the results under Spark 3.0.0: {code:java|title=Spark 3.0.0 (single aggregation)} +--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 3| +--------+---------------------+ {code} Note that the count against {{mithunr}} is {{3}}, accounting for each distinct value for {{NaN}}. The right results are returned when another aggregation is added to the GBY: {code:scala|title=Spark 3.0.0 (multiple aggregations)} scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show +--------+---------------------+----------+ | uid|count(DISTINCT score)|max(score)| +--------+---------------------+----------+ |abellina| 2| 2.0| | mithunr| 1| NaN| +--------+---------------------+----------+ {code} Also, note that Spark 2.4.6 normalizes the {{DISTINCT}} expression correctly: {code:scala|title=Spark 2.4.6} scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show +--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 1| +--------+---------------------+ {code} was: There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an illustration: {code:scala} case class Test( uid:String, score:Float) val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001) val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff) val rows = Seq( Test("mithunr", Float.NaN), Test("mithunr", POS_NAN_1), Test("mithunr", POS_NAN_2), Test("abellina", 1.0f), Test("abellina", 2.0f) ).toDF.createOrReplaceTempView("mytable") spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show {code} Here are the results under Spark 3.0.0: {code:java|title=Spark 3.0.0 (single aggregation)} +--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 3| +--------+---------------------+ {code} Note that the count against {{mithunr}} is {{3}}, accounting for each distinct value for {{NaN}}. The right results are returned when another aggregation is added to the GBY: {code:scala|title=Spark 3.0.0 (multiple aggregations)} scala> spark.sql(" select uid, count(distinct score), max(score) from mytable group by 1 order by 1 asc ").show +--------+---------------------+----------+ | uid|count(DISTINCT score)|max(score)| +--------+---------------------+----------+ |abellina| 2| 2.0| | mithunr| 1| NaN| +--------+---------------------+----------+ {code} Also, note that Spark 2.4.6 normalizes the {{DISTINCT}} expression correctly: {code:scala|title=Spark 2.4.6} scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 order by 1 asc ").show +--------+---------------------+ | uid|count(DISTINCT score)| +--------+---------------------+ |abellina| 2| | mithunr| 1| +--------+---------------------+ {code} > Regression in handling NaN values in COUNT(DISTINCT) > ---------------------------------------------------- > > Key: SPARK-32038 > URL: https://issues.apache.org/jira/browse/SPARK-32038 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL > Affects Versions: 3.0.0 > Reporter: Mithun Radhakrishnan > Priority: Major > > There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} > values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an > illustration: > {code:scala} > case class Test( uid:String, score:Float) > val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001) > val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff) > val rows = Seq( > Test("mithunr", Float.NaN), > Test("mithunr", POS_NAN_1), > Test("mithunr", POS_NAN_2), > Test("abellina", 1.0f), > Test("abellina", 2.0f) > ).toDF.createOrReplaceTempView("mytable") > spark.sql(" select uid, count(distinct score) from mytable group by 1 order > by 1 asc ").show > {code} > Here are the results under Spark 3.0.0: > {code:java|title=Spark 3.0.0 (single aggregation)} > +--------+---------------------+ > | uid|count(DISTINCT score)| > +--------+---------------------+ > |abellina| 2| > | mithunr| 3| > +--------+---------------------+ > {code} > Note that the count against {{mithunr}} is {{3}}, accounting for each > distinct value for {{NaN}}. > The right results are returned when another aggregation is added to the GBY: > {code:scala|title=Spark 3.0.0 (multiple aggregations)} > scala> spark.sql(" select uid, count(distinct score), max(score) from mytable > group by 1 order by 1 asc ").show > +--------+---------------------+----------+ > | uid|count(DISTINCT score)|max(score)| > +--------+---------------------+----------+ > |abellina| 2| 2.0| > | mithunr| 1| NaN| > +--------+---------------------+----------+ > {code} > Also, note that Spark 2.4.6 normalizes the {{DISTINCT}} expression correctly: > {code:scala|title=Spark 2.4.6} > scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 > order by 1 asc ").show > +--------+---------------------+ > | uid|count(DISTINCT score)| > +--------+---------------------+ > |abellina| 2| > | mithunr| 1| > +--------+---------------------+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org