[ https://issues.apache.org/jira/browse/SPARK-35676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362700#comment-17362700 ]
Hyukjin Kwon commented on SPARK-35676: -------------------------------------- {{distinc().count()}} includes nulls but {{countDistinct}} doesn't i guess? > pyspark.sql.functions GroupBy agg CountDistinct() return bad value > ------------------------------------------------------------------ > > Key: SPARK-35676 > URL: https://issues.apache.org/jira/browse/SPARK-35676 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 3.0.1 > Reporter: carlosgv > Priority: Minor > > from pyspark.sql import functions as F > gr_month = list.groupBy(F.year('date').alias('year'), > F.month('date').alias('month')).agg(F.countDistinct('id').alias("n_id")).orderBy(F.col("year").desc(), > F.col("month").desc()).persist() > gr_month.show() > > |year|month|n_id| > |2021|6|58801| > |2021|5|93699| > list.filter(F.year('date') == "2021").filter(F.month('date') == > "6").select('id').distinct().count() > 98916 > > 98916 != 58801 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org