[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752103#comment-16752103 ]
Hyukjin Kwon commented on SPARK-26711: -------------------------------------- Just open a PR that replace one line after manually testing it. I don't think we should update the benchmark again since you're going to update it in https://github.com/apache/spark/pull/23336 > JSON Schema inference takes 15 times longer > ------------------------------------------- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 100000000 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org