[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

viirya Thu, 27 Oct 2016 19:08:29 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/15667


    [SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql 
than it does in hive-client

    ## What changes were proposed in this pull request?
    
    As reported on the jira, insert overwrite statement runs much slower in 
Spark, compared with hive-client.
    
    It seems there is a patch 
[HIVE-11940](https://github.com/apache/hive/commit/ba21806b77287e237e1aa68fa169d2a81e07346d)
 which largely improves insert overwrite performance on Hive. HIVE-11940 is 
patched after Hive 2.0.0.
    
    Because Spark SQL uses older Hive library, we can not benefit from such 
improvement.
    
    The reporter verified that there is also a big performance gap between Hive 
1.2.1 and Hive 2.0.1 on insert overwrite execution.
    
    Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial 
task, this patch provides an approach to delete the partition before asking 
Hive to load data files into the partition.
    
    Note: since `Hive.loadTable` also uses the function to replace files, it 
should has the same issue. We can take the same approach to delete the table 
first. I will upgrade this to include this.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    There are existing tests using insert overwrite statement. Those tests 
should be passed. I added a new test to specially test insert overwrite into 
partition.
    
    For performance issue, as I don't have Hive 2.0 environment, this needs the 
reporter to verify this patch. Please refer to the jira.
    
    Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 improve-hive-insertoverwrite

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15667
    
----
commit 81dbeb19e61a67a287a5762e391517eb55a20721
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2016-10-27T09:29:16Z

    Drop partition before insert overwrite to Hive table.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15667: [SPARK-18107][SQL] Insert overwrite statement run...

Reply via email to