John Bateman created SPARK-35882:
------------------------------------
Summary: Query performance degradation additional predicate and
UDF call for explode
Key: SPARK-35882
URL: https://issues.apache.org/jira/browse/SPARK-35882
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 3.1.2, 3.1.1
Environment: Present on local ubuntu machine. Also CentOS VMs.
Reporter: John Bateman
This issue cannot be seen in 3.0.1, but has been introduced since and observed
in 3.1.1 and 3.1.2. I have a reproduce for this issue here:
[https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file
between 3.1.2 and 3.0.1 to observe the difference in performance. It is a
rather silly example but it demonstrates the issue.
Physical plan for 3.0.1, it executes on my machine in about 40 seconds.
{code:java}
== Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7]
+- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3]
+- *(1) Range (1, 500000, step=1, splits=8)
{code}
Physical plan for 3.1.2, it executes on my machine in about 4.7 min.
{code:java}
== Physical Plan ==
Generate (4)
+- * Project (3)
+- * Filter (2)
+- * Range (1)
(1) Range [codegen id : 1]
Output [1]: [id#2L]
Arguments: Range (1, 500000, step=1, splits=Some(8))
(2) Filter [codegen id : 1]
Input [1]: [id#2L]
Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND
isnotnull(UDF(cast(id#2L as int))))
(3) Project [codegen id : 1]
Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
Input [1]: [id#2L]
(4) Generate
Input [2]: [id#2L, fib#4]
Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
You can see that there is an additional predicate generated in step 2, I can
also confirm that the UDF is now called twice instead of once. I am aware that
this is to be expected sometimes, but it is a change that has resulted in
performance degradation particularly for expensive UDFs. Obviously, there is
something specific to this query (ie the explode) that seems to be responsible
for this predicate and UDF issue occurring, but I am not sure what that is.
For reference, this is the same issue
(https://issues.apache.org/jira/browse/SPARK-35787)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]