[jira] [Commented] (TEZ-4604) Hive compaction in Tez does not delete files under staging directory

Shohei Okumiya (Jira) Thu, 27 Feb 2025 02:00:36 -0800


    [ 
https://issues.apache.org/jira/browse/TEZ-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931078#comment-17931078
 ]


Shohei Okumiya commented on TEZ-4604:
-------------------------------------

Thanks. Now I understand you ran the compaction using MapReduce on Tez. I 
attempted to reproduce it first.

 

Note that `yarn.app.mapreduce.am.staging-dir=/user` is configured in my 
environment, so the path prefix is slightly different. I tested 
`mapreduce.framework.name=yarn-tez` and reproduced the reported issue.
{code:java}
$ hdfs dfs -copyFromLocal /opt/hadoop/README.txt /tmp/README.txt
...
$ hadoop jar 
/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar 
wordcount /tmp/README.txt /tmp/output
...
2025-02-27 09:48:08,967 INFO mapreduce.Job: Job job_1740648231188_0003 
completed successfully
2025-02-27 09:48:08,977 INFO mapreduce.Job: Counters: 0
$ hdfs dfs -ls /user/hdfs/.staging/job_1740648231188_0003
Found 5 items
drwx------   - hdfs supergroup          0 2025-02-27 09:48 
/user/hdfs/.staging/job_1740648231188_0003/.tez
-rw-r--r--  10 hdfs supergroup     281350 2025-02-27 09:48 
/user/hdfs/.staging/job_1740648231188_0003/job.jar
-rw-r--r--  10 hdfs supergroup        101 2025-02-27 09:48 
/user/hdfs/.staging/job_1740648231188_0003/job.split
-rw-r--r--   3 hdfs supergroup        182 2025-02-27 09:48 
/user/hdfs/.staging/job_1740648231188_0003/job.splitmetainfo
-rw-r--r--   3 hdfs supergroup     238974 2025-02-27 09:48 
/user/hdfs/.staging/job_1740648231188_0003/job.xml
$ hdfs dfs -ls /user/history/done
$ hdfs dfs -ls /user/history/done_intermediate {code}
As reported, a MapReduce app with `mapreduce.framework.name=yarn` cleans up the 
staging dir.
{code:java}
$ hdfs dfs -ls /user/hdfs/.staging
$ hdfs dfs -ls /user/history/done
$ hdfs dfs -ls /user/history/done_intermediate
Found 1 items
drwxrwx---   - hdfs hadoop          0 2025-02-27 09:55 
/user/history/done_intermediate/hdfs
$ hdfs dfs -ls /user/history/done_intermediate/hdfs
Found 3 items
-rwxrwx---   3 hdfs hadoop      23122 2025-02-27 09:55 
/user/history/done_intermediate/hdfs/job_1740649999044_0001-1740650106069-hdfs-word+count-1740650121478-1-1-SUCCEEDED-default-1740650111299.jhist
-rwxrwx---   3 hdfs hadoop        438 2025-02-27 09:55 
/user/history/done_intermediate/hdfs/job_1740649999044_0001.summary
-rwxrwx---   3 hdfs hadoop     276218 2025-02-27 09:55 
/user/history/done_intermediate/hdfs/job_1740649999044_0001_conf.xml {code}

> Hive compaction in Tez does not delete files under staging directory
> --------------------------------------------------------------------
>
>                 Key: TEZ-4604
>                 URL: https://issues.apache.org/jira/browse/TEZ-4604
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hiroyuki Nagaya
>            Priority: Critical
>         Attachments: createTable.sql.txt, hive-changed.xml, 
> hive-default.xml.template, mapred-site.xml, tez-changed.xml, 
> tez-default-template.xml
>
>
> I am using a combination of Hadoop, Hive and Tez.
> When I run major compaction with Hive, files under the staging directory are 
> not deleted.
> With Mapreduce, files are deleted from the staging directory and files are 
> created in the history directory.
> Hadoop 3.3.6
> Hive 4.0.1
> Tez 0.10.4
> *1. When using Mapreduce*
> The following data will be deleted.
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.jar
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.split
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.splitmetainfo
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1705466455536_3620/job.xml
> Historical data will be created in the following directories
> /tmp/hadoop-yarn/staging/history/done
> *2. When using Tez*
> The following data will not be deleted
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/.tez
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.jar
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.split
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.splitmetainfo
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002/job.xml
> No historical data will be created.
> *Is it a bug that the following directories are not deleted?*
> *Or is it a Tez configuration problem?*
> *I would like it to be deleted because the process has been completed 
> successfully and it is about 80MB in size.*
> /tmp/hadoop-yarn/staging/hadoop/.staging/job_1740026697751_0002



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TEZ-4604) Hive compaction in Tez does not delete files under staging directory

Reply via email to