[
https://issues.apache.org/jira/browse/SPARK-35882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Bateman updated SPARK-35882:
---------------------------------
Description:
This issue cannot be seen in 3.0.1, but has been introduced since and observed
in 3.1.1 and 3.1.2. I have a reproduce for this issue here:
[https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file
between 3.1.2 and 3.0.1 to observe the difference in performance. It is a
rather silly example but it demonstrates the issue.
Physical plan for 3.0.1, it executes on my machine in about 40 seconds.
{code:java}
== Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7]
+- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3]
+- *(1) Range (1, 500000, step=1, splits=8)
{code}
Physical plan for 3.1.2, it executes on my machine in about 4.7 min.
{code:java}
== Physical Plan ==
Generate (4)
+- * Project (3)
+- * Filter (2)
+- * Range (1)
(1) Range [codegen id : 1]
Output [1]: [id#2L]
Arguments: Range (1, 500000, step=1, splits=Some(8))
(2) Filter [codegen id : 1]
Input [1]: [id#2L]
Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND
isnotnull(UDF(cast(id#2L as int))))
(3) Project [codegen id : 1]
Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
Input [1]: [id#2L]
(4) Generate
Input [2]: [id#2L, fib#4]
Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
You can see that there is an additional predicate generated in step 2, I can
also confirm that the UDF is now called multiple times instead of once. I am
aware that this is to be expected sometimes, but it is a change that has
resulted in performance degradation particularly for expensive UDFs. Obviously,
there is something specific to this query (ie the explode) that seems to be
responsible for this predicate and UDF issue occurring, but I am not sure what
that is.
For reference, this is the same issue
(https://issues.apache.org/jira/browse/SPARK-35787)
was:
This issue cannot be seen in 3.0.1, but has been introduced since and observed
in 3.1.1 and 3.1.2. I have a reproduce for this issue here:
[https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file
between 3.1.2 and 3.0.1 to observe the difference in performance. It is a
rather silly example but it demonstrates the issue.
Physical plan for 3.0.1, it executes on my machine in about 40 seconds.
{code:java}
== Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7]
+- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3]
+- *(1) Range (1, 500000, step=1, splits=8)
{code}
Physical plan for 3.1.2, it executes on my machine in about 4.7 min.
{code:java}
== Physical Plan ==
Generate (4)
+- * Project (3)
+- * Filter (2)
+- * Range (1)
(1) Range [codegen id : 1]
Output [1]: [id#2L]
Arguments: Range (1, 500000, step=1, splits=Some(8))
(2) Filter [codegen id : 1]
Input [1]: [id#2L]
Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND
isnotnull(UDF(cast(id#2L as int))))
(3) Project [codegen id : 1]
Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
Input [1]: [id#2L]
(4) Generate
Input [2]: [id#2L, fib#4]
Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
You can see that there is an additional predicate generated in step 2, I can
also confirm that the UDF is now called twice instead of once. I am aware that
this is to be expected sometimes, but it is a change that has resulted in
performance degradation particularly for expensive UDFs. Obviously, there is
something specific to this query (ie the explode) that seems to be responsible
for this predicate and UDF issue occurring, but I am not sure what that is.
For reference, this is the same issue
(https://issues.apache.org/jira/browse/SPARK-35787)
> Query performance degradation additional predicate and UDF call for explode
> ---------------------------------------------------------------------------
>
> Key: SPARK-35882
> URL: https://issues.apache.org/jira/browse/SPARK-35882
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.1.1, 3.1.2
> Environment: Present on local ubuntu machine. Also CentOS VMs.
> Reporter: John Bateman
> Priority: Major
>
> This issue cannot be seen in 3.0.1, but has been introduced since and
> observed in 3.1.1 and 3.1.2. I have a reproduce for this issue here:
> [https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file
> between 3.1.2 and 3.0.1 to observe the difference in performance. It is a
> rather silly example but it demonstrates the issue.
> Physical plan for 3.0.1, it executes on my machine in about 40 seconds.
>
> {code:java}
> == Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7]
> +- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3]
> +- *(1) Range (1, 500000, step=1, splits=8)
> {code}
>
> Physical plan for 3.1.2, it executes on my machine in about 4.7 min.
>
> {code:java}
> == Physical Plan ==
> Generate (4)
> +- * Project (3)
> +- * Filter (2)
> +- * Range (1)
> (1) Range [codegen id : 1]
> Output [1]: [id#2L]
> Arguments: Range (1, 500000, step=1, splits=Some(8))
> (2) Filter [codegen id : 1]
> Input [1]: [id#2L]
> Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND
> isnotnull(UDF(cast(id#2L as int))))
> (3) Project [codegen id : 1]
> Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
> Input [1]: [id#2L]
> (4) Generate
> Input [2]: [id#2L, fib#4]
> Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
>
> You can see that there is an additional predicate generated in step 2, I can
> also confirm that the UDF is now called multiple times instead of once. I am
> aware that this is to be expected sometimes, but it is a change that has
> resulted in performance degradation particularly for expensive UDFs.
> Obviously, there is something specific to this query (ie the explode) that
> seems to be responsible for this predicate and UDF issue occurring, but I am
> not sure what that is.
> For reference, this is the same issue
> (https://issues.apache.org/jira/browse/SPARK-35787)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]