[ https://issues.apache.org/jira/browse/SPARK-32038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-32038: ---------------------------------- Component/s: (was: Optimizer) > Regression in handling NaN values in COUNT(DISTINCT) > ---------------------------------------------------- > > Key: SPARK-32038 > URL: https://issues.apache.org/jira/browse/SPARK-32038 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Mithun Radhakrishnan > Priority: Major > > There seems to be a regression in Spark 3.0.0, with regard to how {{NaN}} > values are normalized/handled in {{COUNT(DISTINCT ...)}}. Here is an > illustration: > {code:scala} > case class Test( uid:String, score:Float) > val POS_NAN_1 = java.lang.Float.intBitsToFloat(0x7f800001) > val POS_NAN_2 = java.lang.Float.intBitsToFloat(0x7fffffff) > val rows = Seq( > Test("mithunr", Float.NaN), > Test("mithunr", POS_NAN_1), > Test("mithunr", POS_NAN_2), > Test("abellina", 1.0f), > Test("abellina", 2.0f) > ).toDF.createOrReplaceTempView("mytable") > spark.sql(" select uid, count(distinct score) from mytable group by 1 order > by 1 asc ").show > {code} > Here are the results under Spark 3.0.0: > {code:java|title=Spark 3.0.0 (single aggregation)} > +--------+---------------------+ > | uid|count(DISTINCT score)| > +--------+---------------------+ > |abellina| 2| > | mithunr| 3| > +--------+---------------------+ > {code} > Note that the count against {{mithunr}} is {{3}}, accounting for each > distinct value for {{NaN}}. > The right results are returned when another aggregation is added to the GBY: > {code:scala|title=Spark 3.0.0 (multiple aggregations)} > scala> spark.sql(" select uid, count(distinct score), max(score) from mytable > group by 1 order by 1 asc ").show > +--------+---------------------+----------+ > | uid|count(DISTINCT score)|max(score)| > +--------+---------------------+----------+ > |abellina| 2| 2.0| > | mithunr| 1| NaN| > +--------+---------------------+----------+ > {code} > Also, note that Spark 2.4.6 normalizes the {{DISTINCT}} expression correctly: > {code:scala|title=Spark 2.4.6} > scala> spark.sql(" select uid, count(distinct score) from mytable group by 1 > order by 1 asc ").show > +--------+---------------------+ > | uid|count(DISTINCT score)| > +--------+---------------------+ > |abellina| 2| > | mithunr| 1| > +--------+---------------------+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org