[ https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690894#comment-17690894 ]
Apache Spark commented on SPARK-42492: -------------------------------------- User 'Kimahriman' has created a pull request for this issue: https://github.com/apache/spark/pull/40085 > Add new function filter_value > ----------------------------- > > Key: SPARK-42492 > URL: https://issues.apache.org/jira/browse/SPARK-42492 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.3.2 > Reporter: Adam Binford > Priority: Major > > Doing data validation in Spark can lead to a lot of extra evaluations of > expressions. This is because conditionally evaluated expressions aren't > candidates for subexpression elimination. For example a simple expression > such asĀ > {{when(validate(col), col)}} > to only keep col if it matches some condition, will lead to col being > evaluated twice. And if call itself is made up of a series of expensive > expressions itself, like regular expression checks, this can lead to a lot of > wasted computation time. > The initial attempt to resolve this was > https://issues.apache.org/jira/browse/SPARK-35564, adding support for > subexpression elimination to conditionally evaluated expressions. However I > have not been able to get that merged, so this is an alternative (though I > believe that is still useful on top of this). > We can add a new higher order function "filter_value" that takes the column > you want to validate as an argument, and then a function that runs a lambda > expression returning a boolean on whether to keep that column or not. It > would have the same semantics as the above when expression, except it would > guarantee to only evaluate the initial column once. > An alternative would be to implement a real definition for the NullIf > expression, but that would only support exact equals checks and not any > generic condition. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org