[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2023-06-29 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738670#comment-17738670
 ] 

Ignite TC Bot commented on SPARK-35564:
---

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/41677

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366181#comment-17366181
 ] 

Apache Spark commented on SPARK-35564:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/32987

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358286#comment-17358286
 ] 

Adam Binford commented on SPARK-35564:
--

Is that documented somewhere? I know Boolean expressions aren't guaranteed to 
short circuit, but I think most spark users would assume multiple when clauses 
would short circuit

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358285#comment-17358285
 ] 

L. C. Hsieh commented on SPARK-35564:
-

If you mean a common expr in tail conditions other than the first one, it is 
similar as coalesce example above as I think it supposes all conditions can be 
executed without problem. It is still performance consideration here.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358275#comment-17358275
 ] 

Adam Binford commented on SPARK-35564:
--

No the values are fine, it's the condition that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), 
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never 
should be evaluated.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358215#comment-17358215
 ] 

L. C. Hsieh commented on SPARK-35564:
-

Do you mean {{CaseWhen(($"id", myUdf($"id") :: ($"id" + 1, myUdf($"id") :: Nil, 
Some(myUdf($"id")))}}?

{{myUdf($"id")}} always runs for all rows, no?


> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358210#comment-17358210
 ] 

Adam Binford commented on SPARK-35564:
--

You can construct a similar CaseWhen that could lead to a similar problem, the 
coalesce was just simpler to demonstrate

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358209#comment-17358209
 ] 

L. C. Hsieh commented on SPARK-35564:
-

For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), 
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be 
performance issue by pulling a subexpr that might not be executed for a row but 
not a bug. But different to elsevalue in when, coalesce is not a condition 
expression, it supposes all arguments can be executed without problem.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358206#comment-17358206
 ] 

Adam Binford commented on SPARK-35564:
--

Yes that was an example of "will run at least once and maybe more than once" 
that I'm proposing to add more support for in this issue.

An example of current behavior that would be considered a bug is:
{code:java}
spark.range(2).select(coalesce($"id", myUdf($"id")), coalesce($"id" + 1, 
myUdf($"id"))).show()
{code}
myUdf will be pulled out into a subexpression even though it is never executed.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358170#comment-17358170
 ] 

L. C. Hsieh commented on SPARK-35564:
-

{{select(myUdf($"id"), coalesce($"id", myUdf($"id")))}} => Doesn't 
{{myUdf($"id")}} always run at lease once?

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358117#comment-17358117
 ] 

Adam Binford commented on SPARK-35564:
--

Turns out this is already happening for certain when and coalesce expressions. 
For example:
{code:java}
spark.range(2).select(myUdf($"id"), coalesce($"id", myUdf($"id")))
{code}
myUdf gets pulled out as a subexpression even though it might only be executed 
once per row. This can be a correctness issue for very specific edge cases 
similar to https://issues.apache.org/jira/browse/SPARK-35449 where myUdf could 
get executed for a row even though it doesn't pass certain conditional checks

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-31 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354637#comment-17354637
 ] 

Adam Binford commented on SPARK-35564:
--

A 2x gain would be pretty significant to us, I don't know about others. I'm 
planning to implement this in our fork and if I get good results I'll put up a 
PR for further discussion. Could optionally add a config for this if it's 
workload dependent. Also, the only thing it could likely do to the generated 
code is reduce the overall size, albeit with more functional calls in worst 
cases. Whether smaller code size adds any value, I don't know enough about Java 
to know.

>Oh, this is another issue. I noticed it last time when I worked on another PR 
>recently, but don't have time to look at it yet.

I created https://issues.apache.org/jira/browse/SPARK-35580 to track what I've 
figured out so far. Not sure what the right fix is.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-31 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354590#comment-17354590
 ] 

L. C. Hsieh commented on SPARK-35564:
-

> I don't really think this is much of a corner case, but a common case of 
> using a when expression for data validation. Most of our ETL process comes 
> down to normalizing, cleaning, and validating strings, which at the end of 
> the day usually looks like:

This is a corner case because it simplifies other possible cases, although you 
might actually use this pattern in your ETL process.

For example, when we treat an always-evaluate-at-least-once and 
optionally-evaluate-at-least-once expression as subexpression, there are many 
expressions qualified for this. A child expression of the first predicate of 
when, if it is also part of any conditional predicate/value, might also be 
treated as subexpression. Finally we might end with tons of subexpressions like 
that to flood generated code.

On the other hand, how much gain we can get from this case? In the example, for 
the worst case we evaluate it twice, not 5 or 10 times. It may be just small 
piece of the entire ETL process. I feel it's not worth because we might pay a 
lot cost including making the code more complicated and creating tons of 
subexpressions, but in the end we only get a little bit from it and it is also 
only for a worst case.

> though currently higher order functions are always semantically different so 
> they don't get subexpressions regardless I think. That's something I plan to 
> look into as a follow up.

Oh, this is another issue. I noticed it last time when I worked on another PR 
recently, but don't have time to look at it yet.



> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-31 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354447#comment-17354447
 ] 

Adam Binford commented on SPARK-35564:
--

>Do you mean "Create a subexpression if an expression will always be evaluated 
>at least once AND will be evaluated at least once in conditional expression"?

Yeah you can think of it that way in terms of adding to existing functionality. 
I was trying to word it in a way that encompassed existing functionality as 
well.

>And this looks like a corner case, so I'm not sure if it is worth to do this.

I don't really think this is much of a corner case, but a common case of using 
a when expression for data validation. Most of our ETL process comes down to 
normalizing, cleaning, and validating strings, which at the end of the day 
usually looks like:
{code:java}
column = normalize_value(col('my_raw_value'))
result = when(column != '', column){code}
where "normalize_value" usually involves some combination of regexp_repace's, 
lower/upper, and trim.

And things get worse when you are dealing with arrays of strings and want to 
minimize your data:
{code:java}
column = filter(transform(col('my_raw_array_value'), lambda x: 
normalize_value(x)), lambda x: x != '')
result = when(size(column) > 0, column){code}
though currently higher order functions are always semantically different so 
they don't get subexpressions regardless I think. That's something I plan to 
look into as a follow up.

It's natural for users to think that these expressions only get evaluated once, 
and not that they are doubling their runtime trying to clean their data. To me 
the edge case is creating a subexpression in this case decreasing throughput. 
It would require a very large percentage of the rows to not pass the 
conditional check, since the additional calculation is much more expensive than 
the additional function call. I'm playing around with an implementation so 
we'll see how far I can get with it.

 

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-30 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354206#comment-17354206
 ] 

L. C. Hsieh commented on SPARK-35564:
-

Thanks [~hyukjin.kwon] for the ping.

> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.

Do you mean "Create a subexpression if an expression will always be evaluated 
at least once AND will be evaluated at least once in conditional expression"?

I.e., an expression will be evaluated at least 1 + (optional) 1?

> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.

Yea, if the second evaluation doesn't happen, there is redundant function call 
and redundant function added to the generated class (for the split case).

I think here there is no correct answer but a trade-off. I also worry about if 
we will too many optional subexpressions for worst case.










> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-05-30 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354181#comment-17354181
 ] 

Hyukjin Kwon commented on SPARK-35564:
--

cc [~viirya] FYI

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org