[ https://issues.apache.org/jira/browse/SPARK-39467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556033#comment-17556033 ]
Igor Berman commented on SPARK-39467: ------------------------------------- [~hyukjin.kwon] Hi, please take a look at a difference between count(distinct(*)) and count(distinct(_1, _2, _3, _4, _5, _6)) I don't believe it's connected to what you posted from spark sql [documentation|https://spark.apache.org/docs/3.3.0/sql-ref-functions-builtin.html#aggregate-functions] it's not well defined how (if at all) count(distinct(*)) should behave {code} count(DISTINCT expr[, expr...]) Returns the number of rows for which the supplied expression(s) are unique and non-null. {code} just for the reference: myssql, vertica, mysql just prohibit this usage imo, it should be prohibited in sparksql as well > Count on distinct asterisk not equals to the count with column names provided > ----------------------------------------------------------------------------- > > Key: SPARK-39467 > URL: https://issues.apache.org/jira/browse/SPARK-39467 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL > Affects Versions: 3.1.3 > Environment: Spark 3.1.3 vanilla > Reporter: Michael Taranov > Priority: Minor > > Hi everyone, > We came across a case when count distinct with asterisk produce incorrect > result comparing to count distinct when all columns provided. > Example provide below: > {noformat} > scala> val df = Seq( > | (1655172,1463032,"PHON","US",null,1), > | (1655172,1061329,"DESK","AU",null,3), > | (1655172,1334977,"MOBILE","US",null,23), > | (1655172,1165470,"PHON","CR",null,12), > | (1655172,1021215,"PHON","CA","USD",11)).toDF > df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields] > scala> df.printSchema > root > |-- _1: integer (nullable = false) > |-- _2: integer (nullable = false) > |-- _3: string (nullable = true) > |-- _4: string (nullable = true) > |-- _5: string (nullable = true) > |-- _6: integer (nullable = false) > scala> df.createOrReplaceTempView("a_table") > scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, > _3, _4, _5, _6)) from a_table").show(false) > +--------+--------------------------------------+----------------------------------------------------------------------------+ > |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT > named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))| > +--------+--------------------------------------+----------------------------------------------------------------------------+ > |5 |1 |5 > | > +--------+--------------------------------------+----------------------------------------------------------------------------+ > {noformat} > We understand that this is somehow related to null values but in our > understanding asterisk should mimic same behavior as all columns provided. > If there is any documentation about this It would be nice to read. > Any help would be appreciated. > Michael -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org