Josh Rosen created SPARK-28481:
----------------------------------

             Summary: More expressions should extend NullIntolerant
                 Key: SPARK-28481
                 URL: https://issues.apache.org/jira/browse/SPARK-28481
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Josh Rosen


SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic for 
inferring {{IsNotNull}} constraints from expressions. An expression is 
_null-intolerant_ if it returns {{null}} when any of its input expressions are 
{{null}}.

I've noticed that _most_ expressions are null-intolerant: anything which 
extends UnaryExpression / BinaryExpression and keeps the default {{eval}} 
method will be null-intolerant. However, only a subset of these expressions mix 
in the {{NullIntolerant}} trait. As a result, we're missing out on the 
opportunity to infer certain types of non-null constraints: for example, if we 
see a {{WHERE length(x) > 10}} condition then we know that the column {{x}} 
must be non-null and can push this non-null filter down to our datasource scan.

I can think of a few ways to fix this:
 # Modify every relevant expression to mix in the {{NullIntolerant}} trait. We 
can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus reflection) 
to help automate the process of identifying expressions which do not override 
the default {{eval}}.
 # Make a backwards-incompatible change to our abstract base class hierarchy to 
add {{NullSafe*aryExpression}} abstract base classes which define the 
{{nullSafeEval}} method and implement a {{final eval}} method, then leave 
{{eval}} unimplemented in the regular {{*aryExpression}} base classes.
 ** This would fix the somewhat weird code smell that we have today where 
{{nullSafeEval}} has a default implementation which calls {{sys.error}}.
 ** This would negatively impact users who have implemented custom Catalyst 
expressions.
 # Use runtime reflection to determine whether expressions are null-intolerant 
by virtue of using one of the default null-intolerant {{eval}} implementations. 
We can then use this in an {{isNullIntolerant}} helper method which checks that 
classes either (a) extend {{NullIntolerant}} or (b) are null-intolerant 
according to the reflective check (which is basically just figuring out which 
concrete implementation the {{eval}} method resolves to).
 ** We only need to perform the reflection once _per-class_ and can cache the 
result for the lifetime of the JVM, so the performance overheads would be 
pretty small (especially compared to other non-cacheable reflection / traversal 
costs in Catalyst).
 ** The downside is additional complexity in the code which pattern-matches / 
checks for null-intolerance.

Of these approaches, I'm currently leaning towards option 1 (semi-automated 
identification and manual update of hundreds of expressions): if we go with 
that approach then we can perform a one-time catch-up to fix all existing 
expressions. To handle ongoing maintenance (as we add new expressions), I'd 
propose to add "is this null-intolerant?" to a checklist to use when reviewing 
PRs which add new Catalyst expressions. 

/cc [~maropu] [~viirya]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to