John Bateman created SPARK-35882:
------------------------------------

             Summary: Query performance degradation additional predicate and 
UDF call for explode
                 Key: SPARK-35882
                 URL: https://issues.apache.org/jira/browse/SPARK-35882
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.1.2, 3.1.1
         Environment: Present on local ubuntu machine. Also CentOS VMs.
            Reporter: John Bateman


This issue cannot be seen in 3.0.1, but has been introduced since and observed 
in 3.1.1 and 3.1.2. I have a reproduce for this issue here: 
[https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file 
between 3.1.2 and 3.0.1 to observe the difference in performance. It is a 
rather silly example but it demonstrates the issue.

Physical plan for 3.0.1, it executes on my machine in about 40 seconds.

 
{code:java}
== Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7] 
+- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3] 
  +- *(1) Range (1, 500000, step=1, splits=8)
{code}
 

Physical plan for 3.1.2, it executes on my machine in about 4.7 min.

 
{code:java}
== Physical Plan ==
Generate (4)
+- * Project (3)
   +- * Filter (2)
      +- * Range (1)

(1) Range [codegen id : 1]
Output [1]: [id#2L]
Arguments: Range (1, 500000, step=1, splits=Some(8))

(2) Filter [codegen id : 1]
Input [1]: [id#2L]
Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND 
isnotnull(UDF(cast(id#2L as int))))

(3) Project [codegen id : 1]
Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
Input [1]: [id#2L]

(4) Generate
Input [2]: [id#2L, fib#4]
Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
 

You can see that there is an additional predicate generated in step 2, I can 
also confirm that the UDF is now called twice instead of once. I am aware that 
this is to be expected sometimes, but it is a change that has resulted in 
performance degradation particularly for expensive UDFs. Obviously, there is 
something specific to this query (ie the explode) that seems to be responsible 
for this predicate and UDF issue occurring, but I am not sure what that is.

For reference, this is the same issue 
(https://issues.apache.org/jira/browse/SPARK-35787)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to