[jira] [Updated] (SPARK-35882) Query performance degradation additional predicate and UDF call for explode

John Bateman (Jira) Thu, 24 Jun 2021 10:52:08 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Bateman updated SPARK-35882:
---------------------------------
    Description: 
This issue cannot be seen in 3.0.1, but has been introduced since and observed 
in 3.1.1 and 3.1.2. I have a reproduce for this issue here: 
[https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file 
between 3.1.2 and 3.0.1 to observe the difference in performance. It is a 
rather silly example but it demonstrates the issue.

Physical plan for 3.0.1, it executes on my machine in about 40 seconds.

 
{code:java}
== Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7] 
+- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3] 
  +- *(1) Range (1, 500000, step=1, splits=8)
{code}
 

Physical plan for 3.1.2, it executes on my machine in about 4.7 min.

 
{code:java}
== Physical Plan ==
Generate (4)
+- * Project (3)
   +- * Filter (2)
      +- * Range (1)

(1) Range [codegen id : 1]
Output [1]: [id#2L]
Arguments: Range (1, 500000, step=1, splits=Some(8))

(2) Filter [codegen id : 1]
Input [1]: [id#2L]
Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND 
isnotnull(UDF(cast(id#2L as int))))

(3) Project [codegen id : 1]
Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
Input [1]: [id#2L]

(4) Generate
Input [2]: [id#2L, fib#4]
Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
 

You can see that there is an additional predicate generated in step 2, I can 
also confirm that the UDF is now called multiple times instead of once. I am 
aware that this is to be expected sometimes, but it is a change that has 
resulted in performance degradation particularly for expensive UDFs. Obviously, 
there is something specific to this query (ie the explode) that seems to be 
responsible for this predicate and UDF issue occurring, but I am not sure what 
that is.

For reference, this is the same issue 
(https://issues.apache.org/jira/browse/SPARK-35787)

  was:
This issue cannot be seen in 3.0.1, but has been introduced since and observed 
in 3.1.1 and 3.1.2. I have a reproduce for this issue here: 
[https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file 
between 3.1.2 and 3.0.1 to observe the difference in performance. It is a 
rather silly example but it demonstrates the issue.

Physical plan for 3.0.1, it executes on my machine in about 40 seconds.

 
{code:java}
== Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7] 
+- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3] 
  +- *(1) Range (1, 500000, step=1, splits=8)
{code}
 

Physical plan for 3.1.2, it executes on my machine in about 4.7 min.

 
{code:java}
== Physical Plan ==
Generate (4)
+- * Project (3)
   +- * Filter (2)
      +- * Range (1)

(1) Range [codegen id : 1]
Output [1]: [id#2L]
Arguments: Range (1, 500000, step=1, splits=Some(8))

(2) Filter [codegen id : 1]
Input [1]: [id#2L]
Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND 
isnotnull(UDF(cast(id#2L as int))))

(3) Project [codegen id : 1]
Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
Input [1]: [id#2L]

(4) Generate
Input [2]: [id#2L, fib#4]
Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
 

You can see that there is an additional predicate generated in step 2, I can 
also confirm that the UDF is now called twice instead of once. I am aware that 
this is to be expected sometimes, but it is a change that has resulted in 
performance degradation particularly for expensive UDFs. Obviously, there is 
something specific to this query (ie the explode) that seems to be responsible 
for this predicate and UDF issue occurring, but I am not sure what that is.

For reference, this is the same issue 
(https://issues.apache.org/jira/browse/SPARK-35787)


> Query performance degradation additional predicate and UDF call for explode
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-35882
>                 URL: https://issues.apache.org/jira/browse/SPARK-35882
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.1, 3.1.2
>         Environment: Present on local ubuntu machine. Also CentOS VMs.
>            Reporter: John Bateman
>            Priority: Major
>
> This issue cannot be seen in 3.0.1, but has been introduced since and 
> observed in 3.1.1 and 3.1.2. I have a reproduce for this issue here: 
> [https://github.com/johnbateman/spark-udf-slowdown] just change the sbt file 
> between 3.1.2 and 3.0.1 to observe the difference in performance. It is a 
> rather silly example but it demonstrates the issue.
> Physical plan for 3.0.1, it executes on my machine in about 40 seconds.
>  
> {code:java}
> == Physical Plan == Generate explode(fib#3), [id#1L, fib#3], false, [fib2#7] 
> +- *(1) Project [id#1L, UDF(cast(id#1L as int)) AS fib#3] 
>   +- *(1) Range (1, 500000, step=1, splits=8)
> {code}
>  
> Physical plan for 3.1.2, it executes on my machine in about 4.7 min.
>  
> {code:java}
> == Physical Plan ==
> Generate (4)
> +- * Project (3)
>    +- * Filter (2)
>       +- * Range (1)
> (1) Range [codegen id : 1]
> Output [1]: [id#2L]
> Arguments: Range (1, 500000, step=1, splits=Some(8))
> (2) Filter [codegen id : 1]
> Input [1]: [id#2L]
> Condition : ((size(UDF(cast(id#2L as int)), true) > 0) AND 
> isnotnull(UDF(cast(id#2L as int))))
> (3) Project [codegen id : 1]
> Output [2]: [id#2L, UDF(cast(id#2L as int)) AS fib#4]
> Input [1]: [id#2L]
> (4) Generate
> Input [2]: [id#2L, fib#4]
> Arguments: explode(fib#4), [id#2L, fib#4], false, [fib2#11]{code}
>  
> You can see that there is an additional predicate generated in step 2, I can 
> also confirm that the UDF is now called multiple times instead of once. I am 
> aware that this is to be expected sometimes, but it is a change that has 
> resulted in performance degradation particularly for expensive UDFs. 
> Obviously, there is something specific to this query (ie the explode) that 
> seems to be responsible for this predicate and UDF issue occurring, but I am 
> not sure what that is.
> For reference, this is the same issue 
> (https://issues.apache.org/jira/browse/SPARK-35787)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-35882) Query performance degradation additional predicate and UDF call for explode

Reply via email to