[ https://issues.apache.org/jira/browse/SPARK-39467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Taranov updated SPARK-39467: ------------------------------------ Description: Hi everyone, We came across a case when count distinct with asterisk produce incorrect result comparing to count when all columns provided. Example provide below: {noformat} scala> val df = Seq( | (1655172,1463032,"PHON","US",null,1), | (1655172,1061329,"DESK","AU",null,3), | (1655172,1334977,"MOBILE","US",null,23), | (1655172,1165470,"PHON","CR",null,12), | (1655172,1021215,"PHON","CA","USD",11)).toDF df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields] scala> df.printSchema root |-- _1: integer (nullable = false) |-- _2: integer (nullable = false) |-- _3: string (nullable = true) |-- _4: string (nullable = true) |-- _5: string (nullable = true) |-- _6: integer (nullable = false) scala> df.createOrReplaceTempView("a_table") scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, _3, _4, _5, _6)) from a_table").show(false) +--------+--------------------------------------+----------------------------------------------------------------------------+ |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))| +--------+--------------------------------------+----------------------------------------------------------------------------+ |5 |1 |5 | +--------+--------------------------------------+----------------------------------------------------------------------------+ {noformat} We understand that this is somehow related to null values but in our understanding asterisk should mimic same behavior as all columns provided. If there is some documentation about this It would be nice to read. Any help would be appreciated. Michael was: Hi everyone, We came across a case when count distinct with asterisk produce incorrect result. Example provide below: {noformat} scala> val df = Seq( | (1655172,1463032,"PHON","US",null,1), | (1655172,1061329,"DESK","AU",null,3), | (1655172,1334977,"MOBILE","US",null,23), | (1655172,1165470,"PHON","CR",null,12), | (1655172,1021215,"PHON","CA","USD",11)).toDF df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields] scala> df.printSchema root |-- _1: integer (nullable = false) |-- _2: integer (nullable = false) |-- _3: string (nullable = true) |-- _4: string (nullable = true) |-- _5: string (nullable = true) |-- _6: integer (nullable = false) scala> df.createOrReplaceTempView("a_table") scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, _3, _4, _5, _6)) from a_table").show(false) +--------+--------------------------------------+----------------------------------------------------------------------------+ |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))| +--------+--------------------------------------+----------------------------------------------------------------------------+ |5 |1 |5 | +--------+--------------------------------------+----------------------------------------------------------------------------+ {noformat} We understand that this is somehow related to null values but in our understanding asterisk should mimic same behavior as all columns provided. If there is some documentation about this It would be nice to read. Any help would be appreciated. Michael > Count on distinct asterisk not equals to the count with column names provided > ----------------------------------------------------------------------------- > > Key: SPARK-39467 > URL: https://issues.apache.org/jira/browse/SPARK-39467 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 3.1.3 > Environment: Spark 3.1.3 vanilla > Reporter: Michael Taranov > Priority: Minor > > Hi everyone, > We came across a case when count distinct with asterisk produce incorrect > result comparing to count when all columns provided. > Example provide below: > {noformat} > scala> val df = Seq( > | (1655172,1463032,"PHON","US",null,1), > | (1655172,1061329,"DESK","AU",null,3), > | (1655172,1334977,"MOBILE","US",null,23), > | (1655172,1165470,"PHON","CR",null,12), > | (1655172,1021215,"PHON","CA","USD",11)).toDF > df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields] > scala> df.printSchema > root > |-- _1: integer (nullable = false) > |-- _2: integer (nullable = false) > |-- _3: string (nullable = true) > |-- _4: string (nullable = true) > |-- _5: string (nullable = true) > |-- _6: integer (nullable = false) > scala> df.createOrReplaceTempView("a_table") > scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, > _3, _4, _5, _6)) from a_table").show(false) > +--------+--------------------------------------+----------------------------------------------------------------------------+ > |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT > named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))| > +--------+--------------------------------------+----------------------------------------------------------------------------+ > |5 |1 |5 > | > +--------+--------------------------------------+----------------------------------------------------------------------------+ > {noformat} > We understand that this is somehow related to null values but in our > understanding asterisk should mimic same behavior as all columns provided. > If there is some documentation about this It would be nice to read. > Any help would be appreciated. > Michael -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org