[GitHub] spark pull request #16399: [SPARK-18237][SPARK-18703] [SPARK-18675] [SQL] [B...

gatorsmile Sun, 25 Dec 2016 17:17:40 -0800

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/16399


    [SPARK-18237][SPARK-18703] [SPARK-18675] [SQL] [BACKPORT-2.1] CTAS for hive 
serde table should work for all hive versions AND Drop Staging Directories and 
Data Files

    ### What changes were proposed in this pull request?
    
    This PR is to backport https://github.com/apache/spark/pull/15744, 
https://github.com/apache/spark/pull/16104 and 
https://github.com/apache/spark/pull/16134. 
    
    ----------
    [[SPARK-18237][HIVE] hive.exec.stagingdir have no effect
    ](https://github.com/apache/spark/pull/15744)
    
    hive.exec.stagingdir have no effect in spark2.0.1ï¼
    Hive confs in hive-site.xml will be loaded in hadoopConf, so we should use 
hadoopConf in InsertIntoHiveTable instead of SessionState.conf
    
    ----------
    [[SPARK-18675][SQL] CTAS for hive serde table should work for all hive 
versions](https://github.com/apache/spark/pull/16104)
    
    
    Before hive 1.1, when inserting into a table, hive will create the staging 
directory under a common scratch directory. After the writing is finished, hive 
will simply empty the table directory and move the staging directory to it.
    
    After hive 1.1, hive will create the staging directory under the table 
directory, and when moving staging directory to table directory, hive will 
still empty the table directory, but will exclude the staging directory there.
    
    In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which 
means we will always create the staging directory under the table directory, no 
matter what the hive version is. This causes problems if the hive version is 
prior to 1.1, because the staging directory will be removed by hive when hive 
is trying to empty the table directory.
    
    This PR copies the code from hive 0.13, so that we have 2 branches to 
create staging directory. If hive version is prior to 1.1, we'll go to the old 
style branch(i.e. create the staging directory under a common scratch 
directory), else, go to the new style branch(i.e. create the staging directory 
under the table directory)
    
    ----------
    [[SPARK-18703] [SQL] Drop Staging Directories and Data Files After each 
Insertion/CTAS of Hive serde Tables](https://github.com/apache/spark/pull/16134)
    
    Below are the files/directories generated for three inserts againsts a Hive 
table:
    ```
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/._SUCCESS.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/.part-00000.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/_SUCCESS
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/part-00000
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/._SUCCESS.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/.part-00000.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/_SUCCESS
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/part-00000
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/._SUCCESS.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/.part-00000.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/_SUCCESS
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/part-00000
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
    ```
    
    The first 18 files are temporary. We do not drop it until the end of JVM 
termination. If JVM does not appropriately terminate, these temporary 
files/directories will not be dropped.
    
    Only the last two files are needed, as shown below.
    ```
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
    
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
    ```
    The temporary files/directories could accumulate a lot when we issue many 
inserts, since each insert generats at least six files. This could eat a lot of 
spaces and slow down the JVM termination. When the JVM does not terminates 
approprately, the files might not be dropped.
    
    This PR is to drop the created staging files and temporary data files after 
each insert/CTAS. 
    
    
    ### How was this patch tested?
    Added test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark backport18703&18675ToSpark2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16399.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16399
    
----
commit b2abb8ad78b02fa8a7e623864c4b9e32fb1a8b6b
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-12-26T00:40:55Z

    backport SPARK-18237

commit 027b2655b560b0379482bbc66c3b871e9ad841a3
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-12-26T01:05:34Z

    backport SPARK-18675

commit 2482cdce5680ca5c9754fc759d18e4fefa3d8cd5
Author: gatorsmile <gatorsm...@gmail.com>
Date:   2016-12-26T01:13:02Z

    backport SPARK-18703

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16399: [SPARK-18237][SPARK-18703] [SPARK-18675] [SQL] [B...

Reply via email to