[ https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571719#comment-16571719 ]
Sunil Govindan edited comment on YARN-8561 at 8/7/18 3:00 PM: -------------------------------------------------------------- Thanks [~leftnoteasy] for the effort. I have tried to look through the approach and code. Few comments which is mixed or major and minor :) 1. I think we can used same CLI model of client where CLI extends Configured and implements Tool. This helps for tests. Also this helps to avoid abstract run method as its Tool. 2. We could also stop a job from CLI, correct? In that case, do we need to do some thing more extra than a simple yarn app -kill appId ? 3. I think we can use UnitsConversionUtil for unit convertion. CliUtils#parseResourcesString 4. In CapSchedConfig for absolute resource, we used a pattern match code. {code} public static final String PATTERN_FOR_ABSOLUTE_RESOURCE = "^\\[[\\w\\.,\\-_=\\ /]+\\]$"; private static final Pattern RESOURCE_PATTERN = Pattern.compile(PATTERN_FOR_ABSOLUTE_RESOURCE); {code} Could we use same in CLI as well? 5. May be rename JobState to SubmarineJobState 6. Commandline options looks very clean and thorough. I think as we go forward, more CLI options will be added. and it will become more complex. Could we load a profile to submarine and then use the profile get 80% of such config items. Given a profile, may be user might need to fill 1 or 2 variable arguments. 7. DevelopperGuide.md ==> DeveloperGuide.md 8. In getServiceResourceFromYarnResource, I think we should get the resource list from ResourceUtils. Also it might be better to use a common client/server util method to create resource. something like Resource.newInstance(yarnResource) or Resources.createResource(yarnResource) 9. In verbose or debug mode, may be in YarnServiceJobSubmitter could dump all contents of \{{FileWriter fw}} 10. It might be better to add a shutdown signal or interrupt signal to break out from JobMonitor#waitTrainingFinal, if job is faulty. 11. In fromServiceState, service state STOPPED is considered as JobState.SUCCEEDED; 12. Some commented code in JobStatusBuilder 13. How could we increase number of workers on a running job? was (Author: sunilg): Thanks [~leftnoteasy] for the effort. I have tried to look through the approach and code. Few comments which is mixed or major and minor :) 1. I think we can used same CLI model of client where CLI extends Configured and implements Tool. This helps for tests. Also this helps to avoid abstract run method as its Tool. 2. We could also stop a job from CLI, correct? In that case, do we need to do some thing more extra than a simple yarn app -kill appId ? 3. I think we can use UnitsConversionUtil for unit convertion. CliUtils#parseResourcesString 4. In CapSchedConfig for absolute resource, we used a pattern match code. {code} public static final String PATTERN_FOR_ABSOLUTE_RESOURCE = "^\\[[\\w\\.,\\-_=\\ /]+\\]$"; private static final Pattern RESOURCE_PATTERN = Pattern.compile(PATTERN_FOR_ABSOLUTE_RESOURCE); {code} Could we use same in CLI as well? 5. May be rename JobState to SubmarineJobState 6. Commandline options looks very clean and thorough. I think as we go forward, more CLI options will be added. and it will become more complex. Could we load a profile to submarine and then use the profile get 80% of such config items. Given a profile, may be user might need to fill 1 or 2 variable arguments. 7. DevelopperGuide.md ==> DeveloperGuide.md > [Submarine] Add initial implementation: training job submission and job > history retrieve. > ----------------------------------------------------------------------------------------- > > Key: YARN-8561 > URL: https://issues.apache.org/jira/browse/YARN-8561 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Wangda Tan > Assignee: Wangda Tan > Priority: Major > Attachments: YARN-8561.001.patch > > > Added following parts: > 1) New subcomponent of YARN, under applications/ project. > 2) Tensorflow training job submission, including training (single node and > distributed). > - Supported Docker container. > - Support GPU isolation. > - Support YARN registry DNS. > 3) Retrieve job history. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org