[ 
https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833076#comment-17833076
 ] 

Martin Rueckl commented on SPARK-47397:
---------------------------------------

I was told that this is inline with the SQL standard by a colleague :) I am not 
necessarily suggesting to change the behavior.

However, looking at the respective documentation again, the first part

[https://spark.apache.org/docs/latest/sql-ref-null-semantics.html#built-in-aggregate]

explains what you wrote above (except of your extra side note that 'distinct' 
does not change that). 

The second relevant part of the docs then lures the unaware user (like me) more 
into a wrong direction. 
[https://spark.apache.org/docs/latest/sql-ref-null-semantics.html#aggregate-operator-group-by-distinct-]

After reading both those parts, I would 100% expect the null to be counted.
Maybe this exact example should be added to the docs? I can prepare a PR for 
that - If i can figure out where the docs are..

> count_distinct ignores null values
> ----------------------------------
>
>                 Key: SPARK-47397
>                 URL: https://issues.apache.org/jira/browse/SPARK-47397
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, Spark Core
>    Affects Versions: 3.4.1
>            Reporter: Martin Rueckl
>            Priority: Critical
>         Attachments: image-2024-03-14-16-12-35-267.png, 
> image-2024-03-14-16-13-03-107.png, image-2024-04-02-10-32-44-461.png
>
>
> The documentation states, that in group by and count statements, null values 
> will not be ignored / form their own groups.
> !image-2024-03-14-16-13-03-107.png|width=491,height=373!
> However, the behavior of count_distinct does not account for nulls. 
> Either the documentation or the implementation is wrong here...
> !image-2024-03-14-16-12-35-267.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to