[ https://issues.apache.org/jira/browse/SPARK-28481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan reassigned SPARK-28481: ----------------------------------- Assignee: Yuming Wang > More expressions should extend NullIntolerant > --------------------------------------------- > > Key: SPARK-28481 > URL: https://issues.apache.org/jira/browse/SPARK-28481 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Josh Rosen > Assignee: Yuming Wang > Priority: Major > Fix For: 3.1.0 > > > SPARK-13995 introduced the {{NullIntolerant}} trait to generalize the logic > for inferring {{IsNotNull}} constraints from expressions. An expression is > _null-intolerant_ if it returns {{null}} when any of its input expressions > are {{null}}. > I've noticed that _most_ expressions are null-intolerant: anything which > extends UnaryExpression / BinaryExpression and keeps the default {{eval}} > method will be null-intolerant. However, only a subset of these expressions > mix in the {{NullIntolerant}} trait. As a result, we're missing out on the > opportunity to infer certain types of non-null constraints: for example, if > we see a {{WHERE length\(x\) > 10}} condition then we know that the column > {{x}} must be non-null and can push this non-null filter down to our > datasource scan. > I can think of a few ways to fix this: > # Modify every relevant expression to mix in the {{NullIntolerant}} trait. > We can use IDEs or other code-analysis tools (e.g. {{ClassUtil}} plus > reflection) to help automate the process of identifying expressions which do > not override the default {{eval}}. > # Make a backwards-incompatible change to our abstract base class hierarchy > to add {{NullSafe*aryExpression}} abstract base classes which define the > {{nullSafeEval}} method and implement a {{final eval}} method, then leave > {{eval}} unimplemented in the regular {{*aryExpression}} base classes. > ** This would fix the somewhat weird code smell that we have today where > {{nullSafeEval}} has a default implementation which calls {{sys.error}}. > ** This would negatively impact users who have implemented custom Catalyst > expressions. > # Use runtime reflection to determine whether expressions are > null-intolerant by virtue of using one of the default null-intolerant > {{eval}} implementations. We can then use this in an {{isNullIntolerant}} > helper method which checks that classes either (a) extend {{NullIntolerant}} > or (b) are null-intolerant according to the reflective check (which is > basically just figuring out which concrete implementation the {{eval}} method > resolves to). > ** We only need to perform the reflection once _per-class_ and can cache the > result for the lifetime of the JVM, so the performance overheads would be > pretty small (especially compared to other non-cacheable reflection / > traversal costs in Catalyst). > ** The downside is additional complexity in the code which pattern-matches / > checks for null-intolerance. > Of these approaches, I'm currently leaning towards option 1 (semi-automated > identification and manual update of hundreds of expressions): if we go with > that approach then we can perform a one-time catch-up to fix all existing > expressions. To handle ongoing maintenance (as we add new expressions), I'd > propose to add "is this null-intolerant?" to a checklist to use when > reviewing PRs which add new Catalyst expressions. > /cc [~maropu] [~viirya] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org