[jira] [Commented] (TEZ-4604) tez-mapreduce does not delete files under staging directory

Shohei Okumiya (Jira) Sat, 05 Apr 2025 10:24:21 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937407#comment-17937407
 ]


Shohei Okumiya commented on TEZ-4604:
-------------------------------------

[~abstractdog] 

Thanks for giving me some tips.

> I think if the desirable default is to remove the whole directory while using 
> YarnRunner (which might be, if let's say hive does the same), we don't have 
> to introduce a configuration for that but instead should clean up 
> tezBaseStagingDir

> 2. DAG recovery is off, Tez session is started, a query runs successfully, 
> then AM is killed, but the hive tez session is still open, then new query is 
> submitted, hive tries to connect to the same session, so in the background, a 
> new AM is started, and the second query runs successfully there

Just clarifying. Hive on Tez submits a pure Tez application without using 
YARNRunner, where the following directory is allocated for an app master. The 
"/tmp/hive" part derives from "hive.exec.scratchdir". I parsed your intention 
is like: what if the first AM in a Tez session stops and 
"/tmp/hive/\{user}/_tez_session_dir/{session id: e.g., 
6811c3d6-aa13-41ef-844a-826f1af11bc1)" is cleaned up, and the second AM is 
launched. Is this aligned with your intention?

 
{code:java}
/tmp/hive/{user}/_tez_session_dir/{session id: e.g., 
6811c3d6-aa13-41ef-844a-826f1af11bc1)/.tez/{app id}{code}
 

> 1. DAG recovery is on, AM is killed, then restarted, recovery files are 
> successfully read and DAG is successfully recovered

I added a test case to test the situation.

https://github.com/apache/tez/pull/395/commits/a1650eb3e7e35b228dcd2412f13f002de28d656b

 

> tez-mapreduce does not delete files under staging directory
> -----------------------------------------------------------
>
>                 Key: TEZ-4604
>                 URL: https://issues.apache.org/jira/browse/TEZ-4604
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hiroyuki Nagaya
>            Assignee: Shohei Okumiya
>            Priority: Critical
>         Attachments: createTable.sql.txt, hive-changed.xml, 
> hive-default.xml.template, mapred-site.xml, tez-changed.xml, 
> tez-default-template.xml
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I am using a combination of Hadoop, Hive and Tez.
> When I run major compaction with Hive, files under the staging directory are 
> not deleted.
> With Mapreduce, files are deleted from the staging directory and files are 
> created in the history directory.
> Hadoop 3.3.6
> Hive 4.0.1
> Tez 0.10.4
> *1. When using Mapreduce*
> The following data will be deleted.
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.jar
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.split
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.splitmetainfo
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.xml
> Historical data will be created in the following directories
> /tmp/hadoop-yarn/staging/history/done
> *2. When using Tez*
> The following data will not be deleted
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/.tez
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.jar
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.split
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.splitmetainfo
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.xml
> No historical data will be created.
> *Is it a bug that the following directories are not deleted?*
> *Or is it a Tez configuration problem?*
> *I would like it to be deleted because the process has been completed 
> successfully and it is about 80MB in size.*
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEZ-4604) tez-mapreduce does not delete files under staging directory

Reply via email to