Adam Binford created SPARK-42492:
------------------------------------

             Summary: Add new function filter_value
                 Key: SPARK-42492
                 URL: https://issues.apache.org/jira/browse/SPARK-42492
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.3.2
            Reporter: Adam Binford


Doing data validation in Spark can lead to a lot of extra evaluations of 
expressions. This is because conditionally evaluated expressions aren't 
candidates for subexpression elimination. For example a simple expression such 
asĀ 

{{when(validate(col), col)}}

to only keep col if it matches some condition, will lead to col being evaluated 
twice. And if call itself is made up of a series of expensive expressions 
itself, like regular expression checks, this can lead to a lot of wasted 
computation time.

The initial attempt to resolve this was 
https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
subexpression elimination to conditionally evaluated expressions. However I 
have not been able to get that merged, so this is an alternative (though I 
believe that is still useful on top of this).

We can add a new lambda function "filter_value" that takes the column you want 
to validate as an argument, and then a function that runs a lambda expression 
returning a boolean on whether to keep that column or not. It would have the 
same semantics as the above when expression, except it would guarantee to only 
evaluate the initial column once.

An alternative would be to implement a real definition for the NullIf 
expression, but that would only support exact equals checks and not any generic 
condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to