Michael Taranov created SPARK-39467: ---------------------------------------
Summary: Count on distinct asterisk not equals to the count with column names provided Key: SPARK-39467 URL: https://issues.apache.org/jira/browse/SPARK-39467 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 3.1.3 Environment: Spark 3.1.3 vanilla Reporter: Michael Taranov Hi everyone, We came across a case when count distinct with asterisk produce incorrect result. Example provide below: {noformat} scala> val df = Seq( | (1655172,1463032,"PHON","US",null,1), | (1655172,1061329,"DESK","AU",null,3), | (1655172,1334977,"MOBILE","US",null,23), | (1655172,1165470,"PHON","CR",null,12), | (1655172,1021215,"PHON","CA","USD",11)).toDF df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields] scala> df.printSchema root |-- _1: integer (nullable = false) |-- _2: integer (nullable = false) |-- _3: string (nullable = true) |-- _4: string (nullable = true) |-- _5: string (nullable = true) |-- _6: integer (nullable = false) scala> df.createOrReplaceTempView("a_table") scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, _3, _4, _5, _6)) from a_table").show(false) +--------+--------------------------------------+----------------------------------------------------------------------------+ |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))| +--------+--------------------------------------+----------------------------------------------------------------------------+ |5 |1 |5 | +--------+--------------------------------------+----------------------------------------------------------------------------+ {noformat} We understand that this is somehow related to null values but in our understanding asterisk should mimic same behavior as all columns provided. If there is some documentation about this It would be nice to read. Any help would be appreciated. Michael -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org