[ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649397#comment-15649397
 ] 

mingjie tang commented on SPARK-18372:
--------------------------------------

Solution: 
This bug is reported by customers.
The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class of 
(org.apache.hadoop.hive.) to create the staging directory. Default, from the 
hive side, this staging file would be removed after the hive session is 
expired. However, spark fail to notify the hive to remove the staging files.
Thus, follow the code of spark 2.0.x, I just write one function inside the 
InsertIntoHiveTable to create the .staging directory, then, after the session 
expired of spark, this .staging directory would be removed.
This update is tested for the spark 1.5.2 and spark 1.6.3, and the push request 
is : 

For the test, I have manually checking .staging files from table belong 
directory after the spark shell close. 

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-18372
>                 URL: https://issues.apache.org/jira/browse/SPARK-18372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2, 1.6.2, 1.6.3
>         Environment: spark standalone and spark yarn 
>            Reporter: mingjie tang
>             Fix For: 2.0.1
>
>
> Steps to reproduce:
> ================
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =====
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the <TableName> folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
> <property> 
> <name>hive.exec.stagingdir</name> 
> <value>$ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging</value> 
> </property>
> a new .hive-staging folder was created in hive/warehouse/<tablename> folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to