[ https://issues.apache.org/jira/browse/SPARK-24605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-24605: ----------------------------------- Assignee: Maxim Gekk > size(null) should return null > ----------------------------- > > Key: SPARK-24605 > URL: https://issues.apache.org/jira/browse/SPARK-24605 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Maxim Gekk > Assignee: Maxim Gekk > Priority: Minor > > The default behavior size(null) == -1 is a big problem for several reasons: > # It is inconsistent with how SQL functions handle nulls. > # It is an extreme violation of [the Principle of Least > Astonishment|https://en.wikipedia.org/wiki/Principle_of_least_astonishment] > (POLA) > # It is not called out anywhere in the Spark docs or even [the Hive > docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF]. > # It can lead to subtle bugs in analytics. > For example, our client discovered this behavior while investigating > post-click user engagement in their AdTech system. The schema was per ad > placement and post-click user engagements were in an array of structs. The > culprit was > df.groupBy('placementId).agg(sum(size('engagements)).as("engagement_count"), > ...), which subtracted 1 for every click without post-click engagement. > Luckily, the behavior led to negative engagement counts in some periods, > which alerted them to the problem and this bizarre behavior. > Current behavior Spark inherited from Hive. The most consistent behavior, > ignoring the insanity that Hive created in the first place, is for size(null) > to behave as length(null), which returns null. This handles the aggregation > case with sum/avg, etc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org