[ https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621339#comment-16621339 ]
Wangda Tan commented on YARN-8725: ---------------------------------- Thanks [~tangzhankun], {quote}But the job failed due to invalid path passed to "--job-dir" per my testing. It should be a URI start with "hdfs://". {quote} The issue is handled by YARN-8757. {quote}... Because the user is better to not know so such details {quote} I think it is fine since we don't suppose user set to the location if manual checkpoint_path being specified. {quote}Could you please elaborate on this? {quote} Sure, launch of worker/ps rely on these scripts localization. For default case, Submarine client exits when application submitted to YARN. And this is before worker/ps launch. If we do cleanup staging dir right before Submarine client exits, it is very likely that ps/worker launch will be failed. To be able to handle app failure, etc. cases, instead of adding cleanup logics to cli, it's better to have a server to handle this. > Submarine job staging directory has a lot of useless > PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple > times > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-8725 > URL: https://issues.apache.org/jira/browse/YARN-8725 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Zac Zhou > Assignee: Zhankun Tang > Priority: Major > Attachments: YARN-8725-trunk.001.patch > > > Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and > PRIMARY_WORKER-launch-script****.sh to staging dir. > The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job > is submitted multiple times. > But PRIMARY_WORKER-launch-script****.sh would not be overwritten, as it has > random numbers in its name. > The files in the staging dir are as follows: > {code:java} > -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:11 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh > -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:02 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh > -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:06 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh > -rw-r----- 2 hadoop hdfs 15225 2018-08-17 18:46 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh > -rw-r----- 2 hadoop hdfs 580 2018-08-16 20:48 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh > -rw-r----- 2 hadoop hdfs 580 2018-08-17 14:53 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh > -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:16 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh > -rw-r----- 2 hadoop hdfs 8815 2018-08-27 15:54 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml > -rw-r----- 2 hadoop hdfs 11583 2018-08-27 15:54 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml > -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info > {code} > > We should stop the staging dir from growing or have a way to clean it up -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org