[jira] [Commented] (SPARK-41650) json expressions much slower in optimized mode
[ https://issues.apache.org/jira/browse/SPARK-41650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650993#comment-17650993 ] Yi Zhang commented on SPARK-41650: -- just saw that SPARK-33078 added the config to turn it on/off spark.sql.optimizer.enableJsonExpressionOptimization, so no need to change application code to turn it off. > json expressions much slower in optimized mode > -- > > Key: SPARK-41650 > URL: https://issues.apache.org/jira/browse/SPARK-41650 > Project: Spark > Issue Type: Bug > Components: Spark Core, Structured Streaming >Affects Versions: 3.1.3, 3.2.2, 3.3.1 >Reporter: Yi Zhang >Priority: Major > > I noticed spark structured streaming reading from Kafka json string into > struct type is much slower in spark-3.1+ than spark-3.0. Profiling reveals > the json expressions in spark-3.0 mostly on evaluate subExpr, while > spark-3.1/3.2 spent a lot time on writeField. > Suspect this may be related to SPARK-32948, so I tried with add a bogus > option > from_json($"value", mySchema, Map("bogus_key"-> "bogus_value") > this turns off the optimization and the performance is much better. For > reference, > for same amount #records, it is 30 seconds vs. 3 minute on a task processing > 500k records. This is big difference for a streaming job. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41650) json expressions much slower in optimized mode
[ https://issues.apache.org/jira/browse/SPARK-41650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17650114#comment-17650114 ] Yi Zhang commented on SPARK-41650: -- [~gurwls223] , [~viirya] can you help look into this? > json expressions much slower in optimized mode > -- > > Key: SPARK-41650 > URL: https://issues.apache.org/jira/browse/SPARK-41650 > Project: Spark > Issue Type: Bug > Components: Spark Core, Structured Streaming >Affects Versions: 3.2.2 >Reporter: Yi Zhang >Priority: Major > > I noticed spark structured streaming reading from Kafka json string into > struct type is much slower in spark-3.1+ than spark-3.0. Profiling reveals > the json expressions in spark-3.0 mostly on evaluate subExpr, while > spark-3.1/3.2 spent a lot time on writeField. > Suspect this may be related to SPARK-32948, so I tried with add a bogus > option > from_json($"value", mySchema, Map("bogus_key"-> "bogus_value") > this turns off the optimization and the performance is much better. For > reference, > for same amount #records, it is 30 seconds vs. 3 minute on a task processing > 500k records. This is big difference for a streaming job. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org