[jira] [Commented] (SPARK-41650) json expressions much slower in optimized mode

2022-12-21 Thread Yi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650993#comment-17650993
 ] 

Yi Zhang commented on SPARK-41650:
--

just saw that SPARK-33078 added the config to turn it on/off 
spark.sql.optimizer.enableJsonExpressionOptimization, so no need to change 
application code to turn it off.

> json expressions much slower in optimized mode
> --
>
> Key: SPARK-41650
> URL: https://issues.apache.org/jira/browse/SPARK-41650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Structured Streaming
>Affects Versions: 3.1.3, 3.2.2, 3.3.1
>Reporter: Yi Zhang
>Priority: Major
>
> I noticed spark structured streaming reading from Kafka json string into 
> struct type is much slower in spark-3.1+ than spark-3.0. Profiling reveals 
> the json expressions in spark-3.0 mostly on evaluate subExpr, while 
> spark-3.1/3.2 spent a lot time on writeField. 
> Suspect this may be related to SPARK-32948, so I tried with add a bogus 
> option 
> from_json($"value", mySchema, Map("bogus_key"-> "bogus_value")
> this turns off the optimization and the performance is much better. For 
> reference, 
> for same amount #records, it is 30 seconds vs. 3 minute on a task processing 
> 500k records. This is big difference for a streaming job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41650) json expressions much slower in optimized mode

2022-12-20 Thread Yi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650114#comment-17650114
 ] 

Yi Zhang commented on SPARK-41650:
--

[~gurwls223] , [~viirya]  can you help look into this? 

> json expressions much slower in optimized mode
> --
>
> Key: SPARK-41650
> URL: https://issues.apache.org/jira/browse/SPARK-41650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Structured Streaming
>Affects Versions: 3.2.2
>Reporter: Yi Zhang
>Priority: Major
>
> I noticed spark structured streaming reading from Kafka json string into 
> struct type is much slower in spark-3.1+ than spark-3.0. Profiling reveals 
> the json expressions in spark-3.0 mostly on evaluate subExpr, while 
> spark-3.1/3.2 spent a lot time on writeField. 
> Suspect this may be related to SPARK-32948, so I tried with add a bogus 
> option 
> from_json($"value", mySchema, Map("bogus_key"-> "bogus_value")
> this turns off the optimization and the performance is much better. For 
> reference, 
> for same amount #records, it is 30 seconds vs. 3 minute on a task processing 
> 500k records. This is big difference for a streaming job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org