[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752304#comment-16752304 ] Bruce Robbins commented on SPARK-26711: --- [~hyukjin.kwon] Ok, that worked. I had in my mind a more verbose fix. I will open a PR with the one line change. > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752103#comment-16752103 ] Hyukjin Kwon commented on SPARK-26711: -- Just open a PR that replace one line after manually testing it. I don't think we should update the benchmark again since you're going to update it in https://github.com/apache/spark/pull/23336 > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752102#comment-16752102 ] Hyukjin Kwon commented on SPARK-26711: -- Oh, right. Sounds a good lead to follow. We can just add `lazy val decimalTry` to that val in that case. Can you try and make a PR? > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751655#comment-16751655 ] Bruce Robbins commented on SPARK-26711: --- Re: 7 minutes vs. 50 seconds: Looking at the code, it appears the difference is this: Before the timestamp inference change, options.prefersDecimal was checked before attempting to convert the String to a BigDecimal. If options.prefersDecimal is disabled, we would not bother with the conversion. After the timestamp inference change, we always attempt to convert the String to a BigDecimal regardless of the setting of options.prefersDecimal (we still use options.prefersDecimal to determine what type to return) My guess is that attempting to convert every string to a BigDecimal is very expensive. > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750748#comment-16750748 ] Hyukjin Kwon commented on SPARK-26711: -- Hm, the results say something is wrong hm. 50 sec <> 7 mins sounds serious. > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750741#comment-16750741 ] Bruce Robbins commented on SPARK-26711: --- [~hyukjin.kwon] inferTimestamp=: ~13 min inferTimestamp=false: ~7 min 7 minutes is a lot better than 13 minutes, but still not as good as 50 seconds. A quick look in the profiler shows that in the case where inferTimestamp is _disabled_, Spark is spending 96% of its time here: {code:java} val bigDecimal = decimalParser(field) {code} That line did change in the original commit. > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750713#comment-16750713 ] Hyukjin Kwon commented on SPARK-26711: -- So how was the time if {{inferTimestamp}} was enable/disabled? It would be odd even if there's regression with {{inferTimestamp}} disabled. It just compares one if-else. > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26711) JSON Schema inference takes 15 times longer
[ https://issues.apache.org/jira/browse/SPARK-26711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750704#comment-16750704 ] Bruce Robbins commented on SPARK-26711: --- ping [~maxgekk] [~hyukjin.kwon] > JSON Schema inference takes 15 times longer > --- > > Key: SPARK-26711 > URL: https://issues.apache.org/jira/browse/SPARK-26711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > I noticed that the first benchmark/case of JSONBenchmark ("JSON schema > inferring", "No encoding") was taking an hour to run, when it used to run in > 4-5 minutes. > The culprit seems to be this commit: > [https://github.com/apache/spark/commit/d72571e51d] > A quick look using a profiler, and it seems to be spending 99% of its time > doing some kind of exception handling in JsonInferSchema.scala. > You can reproduce in the spark-shell by recreating the data used by the > benchmark > {noformat} > scala> :paste > val rowsNum = 100 * 1000 * 1000 > spark.sparkContext.range(0, rowsNum, 1) > .map(_ => "a") > .toDF("fieldA") > .write > .option("encoding", "UTF-8") > .json("utf8.json") > // Entering paste mode (ctrl-D to finish) > // Exiting paste mode, now interpreting. > rowsNum: Int = 1 > scala> > {noformat} > Then you can run the test by hand starting spark-shell as so (emulating > SqlBasedBenchmark): > {noformat} > bin/spark-shell --driver-memory 8g \ > --conf "spark.sql.autoBroadcastJoinThreshold=1" \ > --conf "spark.sql.shuffle.partitions=1" --master "local[1]" > {noformat} > On commit d72571e51d: > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548297682225 > res0: Long = 815978 <== 13.6 minutes > scala> > {noformat} > On the previous commit (86100df54b): > {noformat} > scala> val start = System.currentTimeMillis; spark.read.json("utf8.json"); > System.currentTimeMillis-start > start: Long = 1548298927151 > res0: Long = 50087 <= 50 seconds > scala> > {noformat} > I also tried {{spark.read.option("inferTimestamp", > false).json("utf8.json")}}, but that option didn't seem to make a difference > in run time. Edit: {{inferTimestamp}} does, in fact, have an impact: It > halves the run time. However, that means even with {{inferTimestamp}}, the > run time is still 7 times slower than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org