[jira] [Updated] (SPARK-39467) Count on distinct asterisk not equals to the count with column names provided

Michael Taranov (Jira) Tue, 14 Jun 2022 05:03:55 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Taranov updated SPARK-39467:
------------------------------------
    Description: 
Hi everyone,

We came across a case when count distinct with asterisk produce incorrect 
result comparing to count when all columns provided.

Example provide below:
{noformat}
scala> val df = Seq(
     |     (1655172,1463032,"PHON","US",null,1),
     |     (1655172,1061329,"DESK","AU",null,3),
     |     (1655172,1334977,"MOBILE","US",null,23),
     |     (1655172,1165470,"PHON","CR",null,12),
     |     (1655172,1021215,"PHON","CA","USD",11)).toDF
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]

scala> df.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: integer (nullable = false)
 |-- _3: string (nullable = true)
 |-- _4: string (nullable = true)
 |-- _5: string (nullable = true)
 |-- _6: integer (nullable = false)

scala> df.createOrReplaceTempView("a_table")

scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, 
_3, _4, _5, _6)) from a_table").show(false)
+--------+--------------------------------------+----------------------------------------------------------------------------+
|count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT 
named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
+--------+--------------------------------------+----------------------------------------------------------------------------+
|5       |1                                     |5                              
                                             |
+--------+--------------------------------------+----------------------------------------------------------------------------+

{noformat}
We understand that this is somehow related to null values but in our 
understanding asterisk should mimic same behavior as all columns provided.

If there is some documentation about this It would be nice to read.

Any help would be appreciated. 

Michael

  was:
Hi everyone,

We came across a case when count distinct with asterisk produce incorrect 
result. 

Example provide below:
{noformat}
scala> val df = Seq(
     |     (1655172,1463032,"PHON","US",null,1),
     |     (1655172,1061329,"DESK","AU",null,3),
     |     (1655172,1334977,"MOBILE","US",null,23),
     |     (1655172,1165470,"PHON","CR",null,12),
     |     (1655172,1021215,"PHON","CA","USD",11)).toDF
df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]

scala> df.printSchema
root
 |-- _1: integer (nullable = false)
 |-- _2: integer (nullable = false)
 |-- _3: string (nullable = true)
 |-- _4: string (nullable = true)
 |-- _5: string (nullable = true)
 |-- _6: integer (nullable = false)

scala> df.createOrReplaceTempView("a_table")

scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, 
_3, _4, _5, _6)) from a_table").show(false)
+--------+--------------------------------------+----------------------------------------------------------------------------+
|count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT 
named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
+--------+--------------------------------------+----------------------------------------------------------------------------+
|5       |1                                     |5                              
                                             |
+--------+--------------------------------------+----------------------------------------------------------------------------+

{noformat}
We understand that this is somehow related to null values but in our 
understanding asterisk should mimic same behavior as all columns provided.

If there is some documentation about this It would be nice to read.

Any help would be appreciated. 

Michael


> Count on distinct asterisk not equals to the count with column names provided
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-39467
>                 URL: https://issues.apache.org/jira/browse/SPARK-39467
>             Project: Spark
>          Issue Type: Question
>          Components: Spark Core
>    Affects Versions: 3.1.3
>         Environment: Spark 3.1.3 vanilla 
>            Reporter: Michael Taranov
>            Priority: Minor
>
> Hi everyone,
> We came across a case when count distinct with asterisk produce incorrect 
> result comparing to count when all columns provided.
> Example provide below:
> {noformat}
> scala> val df = Seq(
>      |     (1655172,1463032,"PHON","US",null,1),
>      |     (1655172,1061329,"DESK","AU",null,3),
>      |     (1655172,1334977,"MOBILE","US",null,23),
>      |     (1655172,1165470,"PHON","CR",null,12),
>      |     (1655172,1021215,"PHON","CA","USD",11)).toDF
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: int ... 4 more fields]
> scala> df.printSchema
> root
>  |-- _1: integer (nullable = false)
>  |-- _2: integer (nullable = false)
>  |-- _3: string (nullable = true)
>  |-- _4: string (nullable = true)
>  |-- _5: string (nullable = true)
>  |-- _6: integer (nullable = false)
> scala> df.createOrReplaceTempView("a_table")
> scala> spark.sql("select count(1), count(distinct(*)), count(distinct(_1, _2, 
> _3, _4, _5, _6)) from a_table").show(false)
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> |count(1)|count(DISTINCT _1, _2, _3, _4, _5, _6)|count(DISTINCT 
> named_struct(_1, _1, _2, _2, _3, _3, _4, _4, _5, _5, _6, _6))|
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> |5       |1                                     |5                            
>                                                |
> +--------+--------------------------------------+----------------------------------------------------------------------------+
> {noformat}
> We understand that this is somehow related to null values but in our 
> understanding asterisk should mimic same behavior as all columns provided.
> If there is some documentation about this It would be nice to read.
> Any help would be appreciated. 
> Michael



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39467) Count on distinct asterisk not equals to the count with column names provided

Reply via email to