[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358117#comment-17358117 ]
Adam Binford commented on SPARK-35564: -------------------------------------- Turns out this is already happening for certain when and coalesce expressions. For example: {code:java} spark.range(2).select(myUdf($"id"), coalesce($"id", myUdf($"id"))) {code} myUdf gets pulled out as a subexpression even though it might only be executed once per row. This can be a correctness issue for very specific edge cases similar to https://issues.apache.org/jira/browse/SPARK-35449 where myUdf could get executed for a row even though it doesn't pass certain conditional checks > Support subexpression elimination for non-common branches of conditional > expressions > ------------------------------------------------------------------------------------ > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.1 > Reporter: Adam Binford > Priority: Major > > https://issues.apache.org/jira/browse/SPARK-33337 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org