GitHub user habren opened a pull request: https://github.com/apache/spark/pull/22018
[SPARK-25038][SQL] Accelerate Spark Plan generation when Spark SQL re⦠https://issues.apache.org/jira/browse/SPARK-25038 When Spark SQL read large amount of data, it take a long time (more than 10 minutes) to generate physical Plan and then ActiveJob Example: There is a table which is partitioned by date and hour. There are more than 13 TB data each hour and 185 TB per day. When we just issue a very simple SQL, it take a long time to generate ActiveJob The SQL statement is select count(device_id) from test_tbl where date=20180731 and hour='21'; Before optimization, it takes 2 minutes and 9 seconds to generate the Job The SQL is issued at 2018-08-07 09:07:41 However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 9 seconds later than the SQL issue time After the optimization, it takes only 4 seconds to generate the Job The SQL is issued at 2018-08-07 09:20:15 And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later than the SQL issue time You can merge this pull request into a Git repository by running: $ git pull https://github.com/habren/spark SPARK-25038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22018 ---- commit 2bb5924e04eba5accfe58a4fbae094d46cc36488 Author: Jason Guo <jason.guo.vip@...> Date: 2018-08-07T03:13:03Z [SPARK-25038][SQL] Accelerate Spark Plan generation when Spark SQL read large amount of data ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org