[ 
https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28481:
-----------------------------------

    Assignee: Yuming Wang

> More expressions should extend NullIntolerant
> ---------------------------------------------
>
>                 Key: SPARK-28481
>                 URL: https://issues.apache.org/jira/browse/SPARK-28481
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Josh Rosen
>            Assignee: Yuming Wang
>            Priority: Major
>             Fix For: 3.1.0
>
>
> SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic 
> for inferring {{IsNotNull}} constraints from expressions. An expression is 
> _null-intolerant_ if it returns {{null}} when any of its input expressions 
> are {{null}}.
> I've noticed that _most_ expressions are null-intolerant: anything which 
> extends UnaryExpression / BinaryExpression and keeps the default {{eval}} 
> method will be null-intolerant. However, only a subset of these expressions 
> mix in the {{NullIntolerant}} trait. As a result, we're missing out on the 
> opportunity to infer certain types of non-null constraints: for example, if 
> we see a {{WHERE length\(x\) > 10}} condition then we know that the column 
> {{x}} must be non-null and can push this non-null filter down to our 
> datasource scan.
> I can think of a few ways to fix this:
>  # Modify every relevant expression to mix in the {{NullIntolerant}} trait. 
> We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus 
> reflection) to help automate the process of identifying expressions which do 
> not override the default {{eval}}.
>  # Make a backwards-incompatible change to our abstract base class hierarchy 
> to add {{NullSafe*aryExpression}} abstract base classes which define the 
> {{nullSafeEval}} method and implement a {{final eval}} method, then leave 
> {{eval}} unimplemented in the regular {{*aryExpression}} base classes.
>  ** This would fix the somewhat weird code smell that we have today where 
> {{nullSafeEval}} has a default implementation which calls {{sys.error}}.
>  ** This would negatively impact users who have implemented custom Catalyst 
> expressions.
>  # Use runtime reflection to determine whether expressions are 
> null-intolerant by virtue of using one of the default null-intolerant 
> {{eval}} implementations. We can then use this in an {{isNullIntolerant}} 
> helper method which checks that classes either (a) extend {{NullIntolerant}} 
> or (b) are null-intolerant according to the reflective check (which is 
> basically just figuring out which concrete implementation the {{eval}} method 
> resolves to).
>  ** We only need to perform the reflection once _per-class_ and can cache the 
> result for the lifetime of the JVM, so the performance overheads would be 
> pretty small (especially compared to other non-cacheable reflection / 
> traversal costs in Catalyst).
>  ** The downside is additional complexity in the code which pattern-matches / 
> checks for null-intolerance.
> Of these approaches, I'm currently leaning towards option 1 (semi-automated 
> identification and manual update of hundreds of expressions): if we go with 
> that approach then we can perform a one-time catch-up to fix all existing 
> expressions. To handle ongoing maintenance (as we add new expressions), I'd 
> propose to add "is this null-intolerant?" to a checklist to use when 
> reviewing PRs which add new Catalyst expressions. 
> /cc [~maropu] [~viirya]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to