[ https://issues.apache.org/jira/browse/SPARK-42492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Binford updated SPARK-42492: --------------------------------- Description: Doing data validation in Spark can lead to a lot of extra evaluations of expressions. This is because conditionally evaluated expressions aren't candidates for subexpression elimination. For example a simple expression such as {{when(validate(col), col)}} to only keep col if it matches some condition, will lead to col being evaluated twice. And if call itself is made up of a series of expensive expressions itself, like regular expression checks, this can lead to a lot of wasted computation time. The initial attempt to resolve this was https://issues.apache.org/jira/browse/SPARK-35564, adding support for subexpression elimination to conditionally evaluated expressions. However I have not been able to get that merged, so this is an alternative (though I believe that is still useful on top of this). We can add a new higher order function "filter_value" that takes the column you want to validate as an argument, and then a function that runs a lambda expression returning a boolean on whether to keep that column or not. It would have the same semantics as the above when expression, except it would guarantee to only evaluate the initial column once. An alternative would be to implement a real definition for the NullIf expression, but that would only support exact equals checks and not any generic condition. was: Doing data validation in Spark can lead to a lot of extra evaluations of expressions. This is because conditionally evaluated expressions aren't candidates for subexpression elimination. For example a simple expression such as {{when(validate(col), col)}} to only keep col if it matches some condition, will lead to col being evaluated twice. And if call itself is made up of a series of expensive expressions itself, like regular expression checks, this can lead to a lot of wasted computation time. The initial attempt to resolve this was https://issues.apache.org/jira/browse/SPARK-35564, adding support for subexpression elimination to conditionally evaluated expressions. However I have not been able to get that merged, so this is an alternative (though I believe that is still useful on top of this). We can add a new lambda function "filter_value" that takes the column you want to validate as an argument, and then a function that runs a lambda expression returning a boolean on whether to keep that column or not. It would have the same semantics as the above when expression, except it would guarantee to only evaluate the initial column once. An alternative would be to implement a real definition for the NullIf expression, but that would only support exact equals checks and not any generic condition. > Add new function filter_value > ----------------------------- > > Key: SPARK-42492 > URL: https://issues.apache.org/jira/browse/SPARK-42492 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.3.2 > Reporter: Adam Binford > Priority: Major > > Doing data validation in Spark can lead to a lot of extra evaluations of > expressions. This is because conditionally evaluated expressions aren't > candidates for subexpression elimination. For example a simple expression > such as > {{when(validate(col), col)}} > to only keep col if it matches some condition, will lead to col being > evaluated twice. And if call itself is made up of a series of expensive > expressions itself, like regular expression checks, this can lead to a lot of > wasted computation time. > The initial attempt to resolve this was > https://issues.apache.org/jira/browse/SPARK-35564, adding support for > subexpression elimination to conditionally evaluated expressions. However I > have not been able to get that merged, so this is an alternative (though I > believe that is still useful on top of this). > We can add a new higher order function "filter_value" that takes the column > you want to validate as an argument, and then a function that runs a lambda > expression returning a boolean on whether to keep that column or not. It > would have the same semantics as the above when expression, except it would > guarantee to only evaluate the initial column once. > An alternative would be to implement a real definition for the NullIf > expression, but that would only support exact equals checks and not any > generic condition. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org