[ https://issues.apache.org/jira/browse/SPARK-25038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Guo updated SPARK-25038: ------------------------------ Description: When Spark SQL read large amount of data, it take a long time (more than 10 minutes) to generate physical Plan and then ActiveJob Example: There is a table which is partitioned by date and hour. There are more than 13 TB data each hour and 185 TB per day. When we just issue a very simple SQL, it take a long time to generate ActiveJob The SQL statement is {code:java} select count(device_id) from test_tbl where date=20180731 and hour='21'; {code} The SQL is issued at 2018-08-07 08:43:48 !image-2018-08-07-08-52-00-558.png! However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 seconds later than the SQL issue time !image-2018-08-07-08-52-09-648.png! was: When Spark SQL read large amount of data, it take a long time (more than 10 minutes) to generate physical Plan and then ActiveJob Example: There is a table which is partitioned by date and hour. There are more than 13 TB data each hour and 185 TB per day. When we just issue a very simple SQL, it take a long time to generate ActiveJob The SQL statement is {code:java} select count(device_id) from test_tbl where date=20180731 and hour='21'; {code} The SQL is issued at 2018-08-07 08:43:48 !image-2018-08-07-08-48-28-753.png! However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and 17 seconds later than the SQL issue time !image-2018-08-07-08-47-06-321.png! > Accelerate Spark Plan generation when Spark SQL read large amount of data > ------------------------------------------------------------------------- > > Key: SPARK-25038 > URL: https://issues.apache.org/jira/browse/SPARK-25038 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Jason Guo > Priority: Critical > > When Spark SQL read large amount of data, it take a long time (more than 10 > minutes) to generate physical Plan and then ActiveJob > > Example: > There is a table which is partitioned by date and hour. There are more than > 13 TB data each hour and 185 TB per day. When we just issue a very simple > SQL, it take a long time to generate ActiveJob > > The SQL statement is > {code:java} > select count(device_id) from test_tbl where date=20180731 and hour='21'; > {code} > > The SQL is issued at 2018-08-07 08:43:48 > !image-2018-08-07-08-52-00-558.png! > However, the job is submitted at 2018-08-07 08:46:05, which is 2minutes and > 17 seconds later than the SQL issue time > !image-2018-08-07-08-52-09-648.png! > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org