[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins updated SPARK-26711: ---------------------------------- Description: I noticed that the first benchmark/case of JSONBenchmark ("JSON schema inferring", "No encoding") was taking an hour to run, when it used to run in 4-5 minutes. The culprit seems to be this commit: [https://github.com/apache/spark/commit/d72571e51d] A quick look using a profiler, and it seems to be spending 99% of its time doing some kind of exception handling in JsonInferSchema.scala. You can reproduce in the spark-shell by recreating the data used by the benchmark {noformat} scala> :paste val rowsNum = 100 * 1000 * 1000 spark.sparkContext.range(0, rowsNum, 1) .map(_ => "a") .toDF("fieldA") .write .option("encoding", "UTF-8") .json("utf8.json") // Entering paste mode (ctrl-D to finish) // Exiting paste mode, now interpreting. rowsNum: Int = 100000000 scala> {noformat} Then you can run the test by hand starting spark-shell as so (emulating SqlBasedBenchmark): {noformat} bin/spark-shell --driver-memory 8g \ --conf "spark.sql.autoBroadcastJoinThreshold=1" \ --conf "spark.sql.shuffle.partitions=1" --master "local[1]" {noformat} On commit d72571e51d: {noformat} scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); System.currentTimeMillis-start start: Long = 1548297682225 res0: Long = 815978 <== 13.6 minutes scala> {noformat} On the previous commit (86100df54b): {noformat} scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); System.currentTimeMillis-start start: Long = 1548298927151 res0: Long = 50087 <= 50 seconds scala> {noformat} I also tried {{spark.read.option("inferTimestamp", false).json("utf8.json")}}, but that option didn't seem to make a difference in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It halves the run time. However, that means even with {{inferTimestamp}}, the run time is still 7 times slower than before. was: I noticed that the first benchmark/case of JSONBenchmark ("JSON schema inferring", "No encoding") was taking an hour to run, when it used to run in 4-5 minutes. The culprit seems to be this commit: [https://github.com/apache/spark/commit/d72571e51d] A quick look using a profiler, and it seems to be spending 99% of its time doing some kind of exception handling in JsonInferSchema.scala. You can reproduce in the spark-shell by recreating the data used by the benchmark {noformat} scala> :paste val rowsNum = 100 * 1000 * 1000 spark.sparkContext.range(0, rowsNum, 1) .map(_ => "a") .toDF("fieldA") .write .option("encoding", "UTF-8") .json("utf8.json") // Entering paste mode (ctrl-D to finish) // Exiting paste mode, now interpreting. rowsNum: Int = 100000000 scala> {noformat} Then you can run the test by hand starting spark-shell as so (emulating SqlBasedBenchmark): {noformat} bin/spark-shell --driver-memory 8g \ --conf "spark.sql.autoBroadcastJoinThreshold=1" \ --conf "spark.sql.shuffle.partitions=1" --master "local[1]" {noformat} On commit d72571e51d: {noformat} scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); System.currentTimeMillis-start start: Long = 1548297682225 res0: Long = 815978 <== 13.6 minutes scala> {noformat} On the previous commit (86100df54b): {noformat} scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); System.currentTimeMillis-start start: Long = 1548298927151 res0: Long = 50087 <= 50 seconds scala> {noformat} I also tried {{spark.read.option("inferTimestamp", false).json("utf8.json")}}, but that option didn't seem to make a difference in run time. Maybe I am using it incorrectly. > JSON Schema inference takes 15 times longer > ------------------------------------------- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 100000000 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org