[ 
https://issues.apache.org/jira/browse/SPARK-35676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362700#comment-17362700
 ] 

Hyukjin Kwon commented on SPARK-35676:
--------------------------------------

{{distinc().count()}} includes nulls but {{countDistinct}} doesn't i guess?

> pyspark.sql.functions GroupBy agg CountDistinct() return bad value
> ------------------------------------------------------------------
>
>                 Key: SPARK-35676
>                 URL: https://issues.apache.org/jira/browse/SPARK-35676
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.0.1
>            Reporter: carlosgv
>            Priority: Minor
>
> from pyspark.sql import functions as F
> gr_month = list.groupBy(F.year('date').alias('year'), 
> F.month('date').alias('month')).agg(F.countDistinct('id').alias("n_id")).orderBy(F.col("year").desc(),
>  F.col("month").desc()).persist()
>  gr_month.show()
>  
> |year|month|n_id|
> |2021|6|58801|
> |2021|5|93699|
> list.filter(F.year('date') == "2021").filter(F.month('date') == 
> "6").select('id').distinct().count()
> 98916
>  
> 98916 != 58801



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to