[GitHub] spark pull request #22018: [SPARK-25038][SQL] Accelerate Spark Plan generati...

habren Mon, 06 Aug 2018 20:26:06 -0700

GitHub user habren opened a pull request:

    https://github.com/apache/spark/pull/22018


    [SPARK-25038][SQL] Accelerate Spark Plan generation when Spark SQL reâ¦

    https://issues.apache.org/jira/browse/SPARK-25038
    
    When Spark SQL read large amount of data, it take a long time (more than 10 
minutes) to generate physical Plan and then ActiveJob
    
     
    
    Example:
    
    There is a table which is partitioned by date and hour. There are more than 
13 TB data each hour and 185 TB per day. When we just issue a very simple SQL, 
it take a long time to generate ActiveJob
    
     
    
    The SQL statement is
    
    select count(device_id) from test_tbl where date=20180731 and hour='21';
     
    
    Before optimization, it takes 2 minutes and 9 seconds to generate the Job
    
     
    
    The SQL is issued at 2018-08-07 09:07:41
    
    
    
    However, the job is submitted at 2018-08-07 09:09:53, which is 2minutes and 
9 seconds later than the SQL issue time
    
    
    
     
    
    After the optimization, it takes only 4 seconds to generate the Job
    
    The SQL is issued at 2018-08-07 09:20:15
    
    
    
     
    
    And the job is submitted at 2018-08-07 09:20:19, which is 4 seconds later 
than the SQL issue time
    
    
    
     
    
     

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/habren/spark SPARK-25038

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22018
    
----
commit 2bb5924e04eba5accfe58a4fbae094d46cc36488
Author: Jason Guo <jason.guo.vip@...>
Date:   2018-08-07T03:13:03Z

    [SPARK-25038][SQL] Accelerate Spark Plan generation when Spark SQL read 
large amount of data

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22018: [SPARK-25038][SQL] Accelerate Spark Plan generati...

Reply via email to