GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/22152
[SPARK-25159][SQL] json schema inference should only trigger one job ## What changes were proposed in this pull request? This fixes a perf regression caused by https://github.com/apache/spark/pull/21376 . We should not use `RDD#toLocalIterator`, which triggers one Spark job per RDD partition. This is very bad for RDDs with a lot of small partitions. To fix it, this PR introduces a way to access SQLConf in the scheduler event loop thread, so that we don't need to use `RDD#toLocalIterator` anymore in `JsonInferSchema`. ## How was this patch tested? a new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark conf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22152.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22152 ---- commit cf13d71cb1b23ad6e5ad4644df8c591bfb7a00f9 Author: Wenchen Fan <wenchen@...> Date: 2018-08-17T04:30:31Z allow accessing SQLConf in the scheduler event loop thread ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org