[ 
https://issues.apache.org/jira/browse/TEZ-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935256#comment-17935256
 ] 

László Bodor commented on TEZ-4604:
-----------------------------------

thanks a lot [~gaya] and [~okumin] for this discussion
be aware that upstream applications like Hive can take control of the staging 
dir handling, and most probably handling the removal also, this is what happens 
in case of a standard hive on tez session, around here:
https://github.com/apache/hive/blob/a7baee7eae69d3d375f0eecb904f6b6371507ebb/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L791
and this can also explain why the same removal doesn't happen on the YarnRunner 
codepath, which is an "interceptor" making MR jobs to run as Tez application
I think if the desirable default is to remove the whole directory while using 
YarnRunner (which might be, if let's say hive does the same), we don't have to 
introduce a configuration for that but instead should clean up tezBaseStagingDir
however I'm not 99% if that's the correct behavior, so need to be tested at 
least in 2 scenarios:

1. DAG recovery is on, AM is killed, then restarted, recovery files are 
successfully read and DAG is successfully recovered
2. DAG recovery is off, Tez session is started, a query runs successfully, then 
AM is killed, but the hive tez session is still open, then new query is 
submitted, hive tries to connect to the same session, so in the background, a 
new AM is started, and the second query runs successfully there


> tez-mapreduce does not delete files under staging directory
> -----------------------------------------------------------
>
>                 Key: TEZ-4604
>                 URL: https://issues.apache.org/jira/browse/TEZ-4604
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hiroyuki Nagaya
>            Assignee: Shohei Okumiya
>            Priority: Critical
>         Attachments: createTable.sql.txt, hive-changed.xml, 
> hive-default.xml.template, mapred-site.xml, tez-changed.xml, 
> tez-default-template.xml
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I am using a combination of Hadoop, Hive and Tez.
> When I run major compaction with Hive, files under the staging directory are 
> not deleted.
> With Mapreduce, files are deleted from the staging directory and files are 
> created in the history directory.
> Hadoop 3.3.6
> Hive 4.0.1
> Tez 0.10.4
> *1. When using Mapreduce*
> The following data will be deleted.
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.jar
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.split
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.splitmetainfo
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.xml
> Historical data will be created in the following directories
> /tmp/hadoop-yarn/staging/history/done
> *2. When using Tez*
> The following data will not be deleted
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/.tez
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.jar
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.split
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.splitmetainfo
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.xml
> No historical data will be created.
> *Is it a bug that the following directories are not deleted?*
> *Or is it a Tez configuration problem?*
> *I would like it to be deleted because the process has been completed 
> successfully and it is about 80MB in size.*
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to