[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15649386#comment-15649386 ]
mingjie tang commented on SPARK-18372: -------------------------------------- the PR is https://github.com/apache/spark/pull/15819 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > ------------------------------------------------------------------------------- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn > Reporter: mingjie tang > Fix For: 2.0.1 > > > Steps to reproduce: > ================ > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > ===== > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the <TableName> folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > <property> > <name>hive.exec.stagingdir</name> > <value>$ {hive.exec.scratchdir} > /$ > {user.name} > /.staging</value> > </property> > a new .hive-staging folder was created in hive/warehouse/<tablename> folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py > Solution: > This bug is reported by customers. > The reason is the org.spark.sql.hive.InsertIntoHiveTable call the hive class > of (org.apache.hadoop.hive.) to create the staging directory. Default, from > the hive side, this staging file would be removed after the hive session is > expired. However, spark fail to notify the hive to remove the staging files. > Thus, follow the code of spark 2.0.x, I just write one function inside the > InsertIntoHiveTable to create the .staging directory, then, after the session > expired of spark, this .staging directory would be removed. > This update is tested for the spark 1.5.2 and spark 1.6.3, and the push > request is : > For the test, I have manually checking .staging files from table belong > directory after the spark shell close. meanwhile, please advise how to write > the test case? because the directory for the related tables can not get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org