[ 
https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690894#comment-17690894
 ] 

Apache Spark commented on SPARK-42492:
--------------------------------------

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/40085

> Add new function filter_value
> -----------------------------
>
>                 Key: SPARK-42492
>                 URL: https://issues.apache.org/jira/browse/SPARK-42492
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: Adam Binford
>            Priority: Major
>
> Doing data validation in Spark can lead to a lot of extra evaluations of 
> expressions. This is because conditionally evaluated expressions aren't 
> candidates for subexpression elimination. For example a simple expression 
> such asĀ 
> {{when(validate(col), col)}}
> to only keep col if it matches some condition, will lead to col being 
> evaluated twice. And if call itself is made up of a series of expensive 
> expressions itself, like regular expression checks, this can lead to a lot of 
> wasted computation time.
> The initial attempt to resolve this was 
> https://issues.apache.org/jira/browse/SPARK-35564, adding support for 
> subexpression elimination to conditionally evaluated expressions. However I 
> have not been able to get that merged, so this is an alternative (though I 
> believe that is still useful on top of this).
> We can add a new higher order function "filter_value" that takes the column 
> you want to validate as an argument, and then a function that runs a lambda 
> expression returning a boolean on whether to keep that column or not. It 
> would have the same semantics as the above when expression, except it would 
> guarantee to only evaluate the initial column once.
> An alternative would be to implement a real definition for the NullIf 
> expression, but that would only support exact equals checks and not any 
> generic condition.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to