[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750741#comment-16750741 ]
Bruce Robbins commented on SPARK-26711: --------------------------------------- [~hyukjin.kwon] inferTimestamp=<default>: ~13 min inferTimestamp=false: ~7 min 7 minutes is a lot better than 13 minutes, but still not as good as 50 seconds. A quick look in the profiler shows that in the case where inferTimestamp is _disabled_, Spark is spending 96% of its time here: {code:java} val bigDecimal = decimalParser(field) {code} That line did change in the original commit. > JSON Schema inference takes 15 times longer > ------------------------------------------- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 100000000 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org