[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
mingjie tang updated SPARK-18372: --------------------------------- Fix Version/s: 2.0.2 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > ------------------------------------------------------------------------------- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn > Reporter: mingjie tang > Fix For: 2.0.2 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > ================ > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-10000 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > ===== > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the <TableName> folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > <property> > <name>hive.exec.stagingdir</name> > <value>$ {hive.exec.scratchdir} > /$ > {user.name} > /.staging</value> > </property> > a new .hive-staging folder was created in hive/warehouse/<tablename> folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org