[jira] [Commented] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec

2018-05-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470877#comment-16470877
 ] 

Wangda Tan commented on YARN-8141:
--

Thanks [~csingh], 

Overall patch looks good, it gonna be better to make sure native service is not 
broken by this. Could you try this on a cluster and see if it works?

> YARN Native Service: Respect 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
> --
>
> Key: YARN-8141
> URL: https://issues.apache.org/jira/browse/YARN-8141
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8141.001.patch, YARN-8141.002.patch
>
>
> Existing YARN native service overwrites 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user 
> specified this in service spec or not. It is important to allow user to mount 
> local folders like /etc/passwd, etc.
> Following logic overwrites the 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment:
> {code:java}
> StringBuilder sb = new StringBuilder();
> for (Entry mount : mountPaths.entrySet()) {
>   if (sb.length() > 0) {
> sb.append(",");
>   }
>   sb.append(mount.getKey());
>   sb.append(":");
>   sb.append(mount.getValue());
> }
> env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", 
> sb.toString());{code}
> Inside AbstractLauncher.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec

2018-05-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469758#comment-16469758
 ] 

Wangda Tan commented on YARN-8141:
--

[~csingh],

Thanks for working on the fix. I think we don't need to keep the old env and 
related logic since it is marked as private.

> YARN Native Service: Respect 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
> --
>
> Key: YARN-8141
> URL: https://issues.apache.org/jira/browse/YARN-8141
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8141.001.patch
>
>
> Existing YARN native service overwrites 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user 
> specified this in service spec or not. It is important to allow user to mount 
> local folders like /etc/passwd, etc.
> Following logic overwrites the 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment:
> {code:java}
> StringBuilder sb = new StringBuilder();
> for (Entry mount : mountPaths.entrySet()) {
>   if (sb.length() > 0) {
> sb.append(",");
>   }
>   sb.append(mount.getKey());
>   sb.append(":");
>   sb.append(mount.getValue());
> }
> env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", 
> sb.toString());{code}
> Inside AbstractLauncher.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8272) Several items are missing from Hadoop 3.1.0 documentation

2018-05-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8272:


 Summary: Several items are missing from Hadoop 3.1.0 documentation
 Key: YARN-8272
 URL: https://issues.apache.org/jira/browse/YARN-8272
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Wangda Tan


>From what I can see there're several missing items like GPU / FPGA: 
>http://hadoop.apache.org/docs/current/

We should add them to hadoop-project/src/site/site.xml in the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-08 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8108:


Assignee: Eric Yang

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8255) Allow option to disable flex for a service component

2018-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466857#comment-16466857
 ] 

Wangda Tan commented on YARN-8255:
--

[~eyang], 

Thanks for commenting, your suggestion makes sense, and has less dev/testing 
overhead. I think we can do as you suggested: allow flexing when restart-policy 
 = always / on-failure; and disallow flexing when restart-policy = never.

We can add a separate allow_flexing flag to spec if once we see solid 
requirements from users.

[~suma.shivaprasad], does this make sense to you, please feel free to share 
your opinions.

> Allow option to disable flex for a service component 
> -
>
> Key: YARN-8255
> URL: https://issues.apache.org/jira/browse/YARN-8255
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
>
> YARN-8080 implements restart capabilities for service component instances. 
> YARN service components should add an option to disallow flexing to support 
> workloads which are essentially batch/iterative jobs which terminate with 
> restart_policy=NEVER/ON_FAILURE. This could be disabled by default for 
> components where restart_policy=NEVER/ON_FAILURE and enabled by default when 
> restart_policy=ALWAYS(which is the default restart_policy) unless explicitly 
> set at the service spec.
> The option could be exposed as part of the component spec as "allow_flexing". 
> cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466656#comment-16466656
 ] 

Wangda Tan commented on YARN-8257:
--

Just took a closer look: 

Since both of environment/launch command will be written to a shell script and 
intercepted by bash, we need to consider following chars should be escaped (add 
a \ before them)
{code:java}
` : execute a command
$ : reference to environment
\ : all other escapes
" : double quotes{code}
Reference: 

[https://superuser.com/questions/163515/bash-how-to-pass-command-line-arguments-containing-special-characters]
 (search "per man bash")

> Native service should automatically adding escapes for environment/launch cmd 
> before sending to YARN
> 
>
> Key: YARN-8257
> URL: https://issues.apache.org/jira/browse/YARN-8257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Gour Saha
>Priority: Critical
>
> Noticed this issue while using native service: 
> Basically, when a string for environment / launch command contains chars like 
> ", /, `: it needs to be escaped twice.
> The first time is from json spec, because of json accept double quote only, 
> it needs an escape.
> The second time is from launch container, what we did for command line is: 
> (ContainerLaunch.java)
> {code:java}
> line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
> And for environment:
> {code:java}
> line("export ", key, "=\"", value, "\"");{code}
> An example of launch_command: 
> {code:java}
> "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop 
> classpath --glob\\`"{code}
> And example of environment:
> {code:java}
> "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
> [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
> [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
> [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
> \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
> \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}
> To improve usability, I think we should auto escape the input string once. 
> (For example, if user specified 
> {code}
> "TF_CONFIG": "\"key\""
> {code}
> We will automatically escape it to:
> {code}
> "TF_CONFIG": \\\"key\\\"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466634#comment-16466634
 ] 

Wangda Tan commented on YARN-8257:
--

Talked to [~gsaha], and [~gsaha] mentioned he will help if get chance. :)

cc: [~sunilg]

> Native service should automatically adding escapes for environment/launch cmd 
> before sending to YARN
> 
>
> Key: YARN-8257
> URL: https://issues.apache.org/jira/browse/YARN-8257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Gour Saha
>Priority: Critical
>
> Noticed this issue while using native service: 
> Basically, when a string for environment / launch command contains chars like 
> ", /, `: it needs to be escaped twice.
> The first time is from json spec, because of json accept double quote only, 
> it needs an escape.
> The second time is from launch container, what we did for command line is: 
> (ContainerLaunch.java)
> {code:java}
> line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
> And for environment:
> {code:java}
> line("export ", key, "=\"", value, "\"");{code}
> An example of launch_command: 
> {code:java}
> "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop 
> classpath --glob\\`"{code}
> And example of environment:
> {code:java}
> "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
> [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
> [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
> [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
> \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
> \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}
> To improve usability, I think we should auto escape the input string once. 
> (For example, if user specified 
> {code}
> "TF_CONFIG": "\"key\""
> {code}
> We will automatically escape it to:
> {code}
> "TF_CONFIG": \\\"key\\\"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-05-07 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8257:


 Summary: Native service should automatically adding escapes for 
environment/launch cmd before sending to YARN
 Key: YARN-8257
 URL: https://issues.apache.org/jira/browse/YARN-8257
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: Wangda Tan
Assignee: Gour Saha


Noticed this issue while using native service: 

Basically, when a string for environment / launch command contains chars like 
", /, `: it needs to be escaped twice.

The first time is from json spec, because of json accept double quote only, it 
needs an escape.

The second time is from launch container, what we did for command line is: 
(ContainerLaunch.java)
{code:java}
line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
And for environment:
{code:java}
line("export ", key, "=\"", value, "\"");{code}
An example of launch_command: 
{code:java}
"launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop classpath 
--glob\\`"{code}
And example of environment:
{code:java}
"TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
[\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
[\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
[\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
\\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
\\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}

To improve usability, I think we should auto escape the input string once. (For 
example, if user specified 
{code}
"TF_CONFIG": "\"key\""
{code}
We will automatically escape it to:
{code}
"TF_CONFIG": \\\"key\\\"
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7892) Revisit NodeAttribute class structure

2018-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466618#comment-16466618
 ] 

Wangda Tan commented on YARN-7892:
--

Thanks [~Naganarasimha], For id(identifier) and key, I think they're 
interchangeable in many scenarios such as entity.id / entity.key. 

However, for the map-like data (1 => 1 mapping), for example map / 
environment-variable, etc. The it should be named as "key" instead of "id", you 
can check 
{{org.apache.hadoop.yarn.api.resource.PlacementConstraint.TargetExpression}} as 
an example.

> Revisit NodeAttribute class structure
> -
>
> Key: YARN-7892
> URL: https://issues.apache.org/jira/browse/YARN-7892
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>Priority: Major
> Attachments: YARN-7892-YARN-3409.001.patch, 
> YARN-7892-YARN-3409.002.patch, YARN-7892-YARN-3409.003.WIP.patch, 
> YARN-7892-YARN-3409.003.patch, YARN-7892-YARN-3409.004.patch, 
> YARN-7892-YARN-3409.005.patch, YARN-7892-YARN-3409.006.patch
>
>
> In the existing structure, we had kept the type and value along with the 
> attribute which would create confusion to the user to understand the APIs as 
> they would not be clear as to what needs to be sent for type and value while 
> fetching the mappings for node(s).
> As well as equals will not make sense when we compare only for prefix and 
> name where as values for them might be different.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec

2018-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466576#comment-16466576
 ] 

Wangda Tan commented on YARN-8141:
--

Thanks [~shaneku...@gmail.com], I think we should consolidate the two, and the 
backward compatibility is not an issue because a. 3.1.0 is an unstable release, 
b. the variable itself is marked as {{@private}}

> YARN Native Service: Respect 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
> --
>
> Key: YARN-8141
> URL: https://issues.apache.org/jira/browse/YARN-8141
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
>
> Existing YARN native service overwrites 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user 
> specified this in service spec or not. It is important to allow user to mount 
> local folders like /etc/passwd, etc.
> Following logic overwrites the 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment:
> {code:java}
> StringBuilder sb = new StringBuilder();
> for (Entry mount : mountPaths.entrySet()) {
>   if (sb.length() > 0) {
> sb.append(",");
>   }
>   sb.append(mount.getKey());
>   sb.append(":");
>   sb.append(mount.getValue());
> }
> env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", 
> sb.toString());{code}
> Inside AbstractLauncher.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec

2018-05-07 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8141:


Assignee: Chandni Singh

> YARN Native Service: Respect 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
> --
>
> Key: YARN-8141
> URL: https://issues.apache.org/jira/browse/YARN-8141
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Chandni Singh
>Priority: Critical
>
> Existing YARN native service overwrites 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user 
> specified this in service spec or not. It is important to allow user to mount 
> local folders like /etc/passwd, etc.
> Following logic overwrites the 
> YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment:
> {code:java}
> StringBuilder sb = new StringBuilder();
> for (Entry mount : mountPaths.entrySet()) {
>   if (sb.length() > 0) {
> sb.append(",");
>   }
>   sb.append(mount.getKey());
>   sb.append(":");
>   sb.append(mount.getValue());
> }
> env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", 
> sb.toString());{code}
> Inside AbstractLauncher.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8255) Allow option to disable flex for a service component

2018-05-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466462#comment-16466462
 ] 

Wangda Tan commented on YARN-8255:
--

Thanks [~suma.shivaprasad] for filing the JIRA and suggestions from [~eyang] / 
[~billie.rinaldi], 

I think the service flexing is different from restart policy: As mentioned by 
[~eyang], restart policy = on_failure / always means some part of the job can 
be *recomputed*. *Recomputable* is different from *Expandable*, an example is 
map-reduce, # of mappers and reducers are determined by InputFormat, which is 
determined before job get launched. Allocating more mappers or reducers than 
pre-calculated while job is running doesn't helpful. Many computation 
frameworks are in this pattern, such as Tensorflow/OpenMPI, etc. adding tasks 
while job is running isn't helpful.

Considering this, I would prefer what Suma suggested, allow user to specify 
allow_flexing, sometimes adding a new instance to a component could lead task 
or even master failure because it is unexpected. I tend to agree making 
allow_flexing=false by default, but I'm also fine with the opposite.

> Allow option to disable flex for a service component 
> -
>
> Key: YARN-8255
> URL: https://issues.apache.org/jira/browse/YARN-8255
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
>
> YARN-8080 implements restart capabilities for service component instances. 
> YARN service components should add an option to disallow flexing to support 
> workloads which are essentially batch/iterative jobs which terminate with 
> restart_policy=NEVER/ON_FAILURE. This could be disabled by default for 
> components where restart_policy=NEVER/ON_FAILURE and enabled by default when 
> restart_policy=ALWAYS(which is the default restart_policy) unless explicitly 
> set at the service spec.
> The option could be exposed as part of the component spec as "allow_flexing". 
> cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8251) Clicking on app link at the header goes to Diagnostics Tab instead of AppAttempt Tab

2018-05-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8251:
-
Reporter: Sumana Sathish  (was: Yesha Vora)

> Clicking on app link at the header goes to Diagnostics Tab instead of 
> AppAttempt Tab
> 
>
> Key: YARN-8251
> URL: https://issues.apache.org/jira/browse/YARN-8251
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Affects Versions: 3.1.0
>Reporter: Sumana Sathish
>Assignee: Yesha Vora
>Priority: Major
> Attachments: YARN-8251.001.patch
>
>
> 1. Click on Application link under Application tab
> 2. It goes to Specific Application page with appAttempt Tab
> 3. Click on the "Application \[app ID\]" link at the top
> 4. It goes to Specific Application page with Diagnostic Tab instead of 
> appAttempt Tab
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8223) ClassNotFoundException when auxiliary service is loaded from HDFS

2018-05-04 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8223:
-
Fix Version/s: 3.1.1
   3.2.0

> ClassNotFoundException when auxiliary service is loaded from HDFS
> -
>
> Key: YARN-8223
> URL: https://issues.apache.org/jira/browse/YARN-8223
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8223.001.patch, YARN-8223.002.patch
>
>
> Loading an auxiliary jar from a local location on a node manager works as 
> expected,
> {noformat}
> 2018-04-26 15:09:26,179 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(98)) - classpath: 
> [file:/grid/0/hadoop/yarn/local/aux-service-local.jar]
> 2018-04-26 15:09:26,179 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(99)) - system classes: [java., 
> javax.accessibility., javax.activation., javax.activity., javax.annotation., 
> javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., 
> javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., 
> javax.net., javax.print., javax.rmi., javax.script., 
> -javax.security.auth.message., javax.security.auth., javax.security.cert., 
> javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., 
> javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., 
> org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2018-04-26 15:09:26,181 INFO  containermanager.AuxServices 
> (AuxServices.java:serviceInit(252)) - The aux service:test_aux_local are 
> using the custom classloader
> 2018-04-26 15:09:26,182 WARN  containermanager.AuxServices 
> (AuxServices.java:serviceInit(268)) - The Auxiliary Service named 
> 'test_aux_local' in the configuration is for class 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader
>  which has a name of 'org.apache.auxtest.AuxServiceFromLocal with custom 
> class loader'. Because these are not the same tools trying to send 
> ServiceData and read Service Meta Data may have issues unless the refer to 
> the name in the config.
> 2018-04-26 15:09:26,182 INFO  containermanager.AuxServices 
> (AuxServices.java:addService(103)) - Adding auxiliary service 
> org.apache.auxtest.AuxServiceFromLocal with custom class loader, 
> "test_aux_local"{noformat}
> But loading the same jar from a location on HDFS fails with a 
> ClassNotFoundException.
> {noformat}
> 018-04-26 15:14:39,683 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(98)) - classpath: []
> 2018-04-26 15:14:39,683 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(99)) - system classes: [java., 
> javax.accessibility., javax.activation., javax.activity., javax.annotation., 
> javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., 
> javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., 
> javax.net., javax.print., javax.rmi., javax.script., 
> -javax.security.auth.message., javax.security.auth., javax.security.cert., 
> javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., 
> javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., 
> org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2018-04-26 15:14:39,687 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromLocal
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
>   at 
> 

[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-05-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463331#comment-16463331
 ] 

Wangda Tan commented on YARN-8234:
--

[~ziqian hu], mind to check the Jenkins report?

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN-8234-branch-2.8.3.001.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8232) RMContainer lost queue name when RM HA happens

2018-05-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463329#comment-16463329
 ] 

Wangda Tan commented on YARN-8232:
--

+1, thanks [~ziqian hu], will commit tomorrow if no objections.

> RMContainer lost queue name when RM HA happens
> --
>
> Key: YARN-8232
> URL: https://issues.apache.org/jira/browse/YARN-8232
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN-8232-branch-2.8.3.001.patch, YARN-8232.001.patch, 
> YARN-8232.002.patch, YARN-8232.003.patch
>
>
> RMContainer has a member variable queuename to store which queue the 
> container belongs to. When RM HA happens and RMContainers are recovered by 
> scheduler based on NM reports, the queue name isn't recovered and always be 
> null.
> This situation causes some problems. Here is a case in preemption. Preemption 
> uses container's queue name to deduct preemptable resources when we use more 
> than one preempt selector, (for example, enable intra-queue preemption,) . 
> The detail is in
> {code:java}
> CapacitySchedulerPreemptionUtils.deductPreemptableResourcesBasedSelectedCandidates(){code}
> If the contain's queue name is null, this function will throw a 
> YarnRuntimeException because it tries to get the container's 
> TempQueuePerPartition and the preemption fails.
> Our patch solved this problem by setting container queue name when recover 
> containers. The patch is based on branch-2.8.3.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-05-03 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463324#comment-16463324
 ] 

Wangda Tan commented on YARN-4606:
--

Thanks [~maniraj...@gmail.com], 
Some questions: 

1) Does this patch handles the case that one user has multiple pending apps? 
(Since it doesn't store user to apps information).

2) 
{code}
abstractUsersManager.decrNumActiveUsersOfPendingApps(); 
{code}
Should we call this inside 
{{SchedulerApplicationAttempt#pullNewlyUpdatedContainers}}? 
I think we should remove active user from pending apps once AM container get 
allocated.

3)
{code} 
Resources.lessThan(rc, cr,
metrics.getUsedAMResources(), metrics.getMaxAMResources())
{code} 
Instead of using metrics, it might be better to use 
{{SchedulerApplicationAttempt#getAppAttemptResourceUsage}} instead. 

> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Assignee: Manikandan R
>Priority: Critical
> Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, 
> YARN-4606.POC.2.patch, YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-05-02 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8242:
-
Target Version/s: 3.1.1
Priority: Blocker  (was: Major)

> YARN NM: OOM error while reading back the state store on recovery
> -
>
> Key: YARN-8242
> URL: https://issues.apache.org/jira/browse/YARN-8242
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.0
>Reporter: Kanwaljeet Sachdev
>Assignee: Kanwaljeet Sachdev
>Priority: Blocker
> Attachments: YARN-8242.001.patch
>
>
> On startup the NM reads its state store and builds a list of application in 
> the state store to process. If the number of applications in the state store 
> is large and have a lot of "state" connected to it the NM can run OOM and 
> never get to the point that it can start processing the recovery.
> Since it never starts the recovery there is no way for the NM to ever pass 
> this point. It will require a change in heap size to get the NM started.
>  
> Following is the stack trace
> {code:java}
> at java.lang.OutOfMemoryError. (OutOfMemoryError.java:48) at 
> com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
> com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47069) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto. 
> (YarnProtos.java:47014) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47102) at 
> org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom
>  (YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:41016) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto. 
> (YarnProtos.java:40942) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41080) at 
> org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
>  (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
> (CodedInputStream.java:309) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24517) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.
>  (YarnServiceProtos.java:24464) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24568) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
>  (YarnServiceProtos.java:24563) at 
> com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
> at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
> com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
>  (YarnServiceProtos.java:24739) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
>  (NMLeveldbStateStoreService.java:217) at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
>  (NMLeveldbStateStoreService.java:170) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
>  (ContainerManagerImpl.java:253) at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
>  (ContainerManagerImpl.java:237) at 
> org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
> org.apache.hadoop.service.CompositeService.serviceInit 
> (CompositeService.java:107) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
> (NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
> (AbstractService.java:163) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
> (NodeManager.java:474) at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
> (NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4781) Support intra-queue preemption for fairness ordering policy.

2018-04-30 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-4781:


Assignee: Eric Payne  (was: Wangda Tan)

> Support intra-queue preemption for fairness ordering policy.
> 
>
> Key: YARN-4781
> URL: https://issues.apache.org/jira/browse/YARN-4781
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Wangda Tan
>Assignee: Eric Payne
>Priority: Major
> Attachments: YARN-4781.001.patch, YARN-4781.002.patch, 
> YARN-4781.003.patch, YARN-4781.004.patch
>
>
> We introduced fairness queue policy since YARN-3319, which will let large 
> applications make progresses and not starve small applications. However, if a 
> large application takes the queue’s resources, and containers of the large 
> app has long lifespan, small applications could still wait for resources for 
> long time and SLAs cannot be guaranteed.
> Instead of wait for application release resources on their own, we need to 
> preempt resources of queue with fairness policy enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8232) RMContainer lost queue name when RM HA happens

2018-04-30 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458691#comment-16458691
 ] 

Wangda Tan commented on YARN-8232:
--

Thanks [~ziqian hu], could you add an unit test to avoid regression in the 
future?

> RMContainer lost queue name when RM HA happens
> --
>
> Key: YARN-8232
> URL: https://issues.apache.org/jira/browse/YARN-8232
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN-8232-branch-2.8.3.001.patch, YARN-8232.001.patch
>
>
> RMContainer has a member variable queuename to store which queue the 
> container belongs to. When RM HA happens and RMContainers are recovered by 
> scheduler based on NM reports, the queue name isn't recovered and always be 
> null.
> This situation causes some problems. Here is a case in preemption. Preemption 
> uses container's queue name to deduct preemptable resources when we use more 
> than one preempt selector, (for example, enable intra-queue preemption,) . 
> The detail is in
> {code:java}
> CapacitySchedulerPreemptionUtils.deductPreemptableResourcesBasedSelectedCandidates(){code}
> If the contain's queue name is null, this function will throw a 
> YarnRuntimeException because it tries to get the container's 
> TempQueuePerPartition and the preemption fails.
> Our patch solved this problem by setting container queue name when recover 
> containers. The patch is based on branch-2.8.3.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-04-29 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458292#comment-16458292
 ] 

Wangda Tan commented on YARN-8234:
--

Thank [~ziqian hu], this is an interesting fix. I think it is important to 
sacrifice metrics during a short period of time (like 1 min) to get better 
performance. Similar to YARN-8232, could you check branch / patch name. 

+ [~rohithsharma] to do better reviews of this. [~rohithsharma], could you 
answer: a. do we have similar issue after enable TSv2? b. is there any severe 
side effect (like causing failures, etc.) of losing RM metrics data?

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN_8234.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8232) RMContainer lost queue name when RM HA happens

2018-04-29 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458291#comment-16458291
 ] 

Wangda Tan commented on YARN-8232:
--

Thanks [~ziqian hu] for reporting and work on the patch. 

Could you create a patch on top of trunk? That's typically what we do fixes. 

The patch should be named (JIRA_NUMBER.version.patch). You can check 
https://wiki.apache.org/hadoop/HowToContribute for details.

And for the patch, instead of getting application inside the func, you can pass 
queue name from external function ({{recoverContainersOnNode}}), which can 
avoid accessing scheduler application once.

> RMContainer lost queue name when RM HA happens
> --
>
> Key: YARN-8232
> URL: https://issues.apache.org/jira/browse/YARN-8232
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN_8232.patch
>
>
> RMContainer has a member variable queuename to store which queue the 
> container belongs to. When RM HA happens and RMContainers are recovered by 
> scheduler based on NM reports, the queue name isn't recovered and always be 
> null.
> This situation causes some problems. Here is a case in preemption. Preemption 
> uses container's queue name to deduct preemptable resources when we use more 
> than one preempt selector, (for example, enable intra-queue preemption,) . 
> The detail is in
> {code:java}
> CapacitySchedulerPreemptionUtils.deductPreemptableResourcesBasedSelectedCandidates(){code}
> If the contain's queue name is null, this function will throw a 
> YarnRuntimeException because it tries to get the container's 
> TempQueuePerPartition and the preemption fails.
> Our patch solved this problem by setting container queue name when recover 
> containers. The patch is based on branch-2.8.3.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8232) RMContainer lost queue name when RM HA happens

2018-04-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8232:


Assignee: Hu Ziqian

> RMContainer lost queue name when RM HA happens
> --
>
> Key: YARN-8232
> URL: https://issues.apache.org/jira/browse/YARN-8232
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN_8232.patch
>
>
> RMContainer has a member variable queuename to store which queue the 
> container belongs to. When RM HA happens and RMContainers are recovered by 
> scheduler based on NM reports, the queue name isn't recovered and always be 
> null.
> This situation causes some problems. Here is a case in preemption. Preemption 
> uses container's queue name to deduct preemptable resources when we use more 
> than one preempt selector, (for example, enable intra-queue preemption,) . 
> The detail is in
> {code:java}
> CapacitySchedulerPreemptionUtils.deductPreemptableResourcesBasedSelectedCandidates(){code}
> If the contain's queue name is null, this function will throw a 
> YarnRuntimeException because it tries to get the container's 
> TempQueuePerPartition and the preemption fails.
> Our patch solved this problem by setting container queue name when recover 
> containers. The patch is based on branch-2.8.3.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-04-29 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8234:


Assignee: Hu Ziqian

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Attachments: YARN_8234.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8005) Add unit tests for queue priority with dominant resource calculator  

2018-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456996#comment-16456996
 ] 

Wangda Tan edited comment on YARN-8005 at 4/27/18 8:25 PM:
---

Committed to trunk, branch-3.1, thanks [~Zian Chen] and thanks reviews from 
[~sunilg]!


was (Author: leftnoteasy):
Committed to trunk, branch-3.0/1, thanks [~Zian Chen] and thanks reviews from 
[~sunilg]!

> Add unit tests for queue priority with dominant resource calculator  
> -
>
> Key: YARN-8005
> URL: https://issues.apache.org/jira/browse/YARN-8005
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8005.001.patch, YARN-8005.002.patch, 
> YARN-8005.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8005) Add unit tests for queue priority with dominant resource calculator  

2018-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457004#comment-16457004
 ] 

Wangda Tan commented on YARN-8005:
--

Update: 

Before pushing to branch-3.0, I found it causes compilation error. [~Zian Chen] 
could you provide a patch based on branch-3.0 when you get chance? 

> Add unit tests for queue priority with dominant resource calculator  
> -
>
> Key: YARN-8005
> URL: https://issues.apache.org/jira/browse/YARN-8005
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8005.001.patch, YARN-8005.002.patch, 
> YARN-8005.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8005) Add unit tests for queue priority with dominant resource calculator  

2018-04-27 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8005:
-
Target Version/s: 3.0.3

> Add unit tests for queue priority with dominant resource calculator  
> -
>
> Key: YARN-8005
> URL: https://issues.apache.org/jira/browse/YARN-8005
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8005.001.patch, YARN-8005.002.patch, 
> YARN-8005.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8005) Add unit tests for queue priority with dominant resource calculator  

2018-04-27 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8005:
-
Fix Version/s: (was: 3.0.3)

> Add unit tests for queue priority with dominant resource calculator  
> -
>
> Key: YARN-8005
> URL: https://issues.apache.org/jira/browse/YARN-8005
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Zian Chen
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8005.001.patch, YARN-8005.002.patch, 
> YARN-8005.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7574) Add support for Node Labels on Auto Created Leaf Queue Template

2018-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456925#comment-16456925
 ] 

Wangda Tan commented on YARN-7574:
--

[~suma.shivaprasad], this patch added 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.Allocation#toString and 
print it for every allocation. 

In a large cluster it could generate tons of logs. I suggest to make following 
logic under LOG.isDebugEnabled: 

{code} 
LOG.info("Allocation for application " + applicationAttemptId + " : " +
allocation + " with cluster resource : " + getClusterResource());
{code} 

Could you do file a 3.1 critical issue?

> Add support for Node Labels on Auto Created Leaf Queue Template
> ---
>
> Key: YARN-7574
> URL: https://issues.apache.org/jira/browse/YARN-7574
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-7574.1.patch, YARN-7574.10.patch, 
> YARN-7574.11.patch, YARN-7574.12.patch, YARN-7574.2.patch, YARN-7574.3.patch, 
> YARN-7574.4.patch, YARN-7574.5.patch, YARN-7574.6.patch, YARN-7574.7.patch, 
> YARN-7574.8.patch, YARN-7574.9.patch
>
>
> YARN-7473 adds support for auto created leaf queues to inherit node labels 
> capacities from parent queues. Howebver there is no support for leaf queue 
> template to allow different configured capacities for different node labels. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8005) Add unit tests for queue priority with dominant resource calculator  

2018-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456855#comment-16456855
 ] 

Wangda Tan commented on YARN-8005:
--

+1, thanks [~Zian Chen]. will commit shortly.

> Add unit tests for queue priority with dominant resource calculator  
> -
>
> Key: YARN-8005
> URL: https://issues.apache.org/jira/browse/YARN-8005
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Zian Chen
>Priority: Critical
> Attachments: YARN-8005.001.patch, YARN-8005.002.patch, 
> YARN-8005.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8225) YARN precommit build failing in TestPlacementConstraintTransformations

2018-04-27 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456852#comment-16456852
 ] 

Wangda Tan commented on YARN-8225:
--

+1, thanks [~shaneku...@gmail.com], will commit shortly.

> YARN precommit build failing in TestPlacementConstraintTransformations
> --
>
> Key: YARN-8225
> URL: https://issues.apache.org/jira/browse/YARN-8225
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Billie Rinaldi
>Assignee: Shane Kumpf
>Priority: Critical
> Attachments: YARN-8225.001.patch
>
>
> The HashSet comparison is not working for some reason:
> {noformat}
> java.lang.AssertionError: expected: java.util.HashSet<[hb]> but was: 
> java.util.HashSet<[hb]>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:743)
>   at org.junit.Assert.assertEquals(Assert.java:118)
>   at org.junit.Assert.assertEquals(Assert.java:144)
>   at 
> org.apache.hadoop.yarn.api.resource.TestPlacementConstraintTransformations.testCardinalityConstraint(TestPlacementConstraintTransformations.java:116)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455609#comment-16455609
 ] 

Wangda Tan edited comment on YARN-8079 at 4/27/18 12:49 AM:


I'm too packed recently to finish this patch, discussed with 
[~suma.shivaprasad] offline and she will offer help to finish this JIRA. Just 
reassigned.


was (Author: leftnoteasy):
I'm a bit packed recently to finish this patch, discussed with 
[~suma.shivaprasad] offline and she will offer help to finish this JIRA. Just 
reassigned.

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455609#comment-16455609
 ] 

Wangda Tan commented on YARN-8079:
--

I'm a bit packed recently to finish this patch, discussed with 
[~suma.shivaprasad] offline and she will offer help to finish this JIRA. Just 
reassigned.

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8080) YARN native service should support component restart policy

2018-04-26 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8080:


Assignee: Suma Shivaprasad  (was: Wangda Tan)

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-26 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-8079:


Assignee: Suma Shivaprasad  (was: Wangda Tan)

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455562#comment-16455562
 ] 

Wangda Tan commented on YARN-8079:
--

Found previous spec mentioned has some issues, for the files part, it should 
be: 
{code} 
  "components" : [ {
"name" : "primary-worker",
"configuration" : {
  "properties" : { },
  "files" : [ {
"type" : "STATIC",
"dest_file" : "run-PRIMARY_WORKER.sh",
"src_file" : "hdfs://:8020/file",
  } ]
},
  }
  ]
{code}

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8080) YARN native service should support component restart policy

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455547#comment-16455547
 ] 

Wangda Tan commented on YARN-8080:
--

Following spec can be used to do tests:
{code}
{
  "version": "100",
  "name": "sleeper-service",
  "components" :
[
  {
"name": "sleeper",
"number_of_containers": 1,
"launch_command": "sleep 1",
"restart_policy": "NEVER",
"resource": {
  "cpus": 1,
  "memory": "256"
   }
  }
]
}
{code}

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455541#comment-16455541
 ] 

Wangda Tan commented on YARN-8079:
--

Attached ver.6 patch, fixed all issues. Spec mentioned by Eric above: 
https://issues.apache.org/jira/browse/YARN-8079?focusedCommentId=16419665=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16419665
 could be used to do tests. 

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-26 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8079:
-
Attachment: YARN-8079.006.patch

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8080) YARN native service should support component restart policy

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455539#comment-16455539
 ] 

Wangda Tan commented on YARN-8080:
--

Attached ver.6 patch, addressed all comments from Gour except following comment 
and documentation changes: 

bq. Will this cover the scenario of flex? ...
Yeah you're right, we should address the flex issue, one solution is to change 
suceededInstances/failedInstances after flex down as [~gsaha] mentioned. 

I couldn't find cycle to finish the patch, but since this is critical. It's 
better to get somebody else take over this one.

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8080) YARN native service should support component restart policy

2018-04-26 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8080:
-
Attachment: YARN-8080.006.patch

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8210) AMRMClient logging on every heartbeat to track updation of AM RM token causes too many log lines to be generated in AM logs

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455511#comment-16455511
 ] 

Wangda Tan commented on YARN-8210:
--

+1, thanks [~suma.shivaprasad]

> AMRMClient logging on every heartbeat to track updation of AM RM token causes 
> too many log lines to be generated in AM logs
> ---
>
> Key: YARN-8210
> URL: https://issues.apache.org/jira/browse/YARN-8210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.9.0, 3.0.0-alpha1
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8210.1.patch
>
>
> YARN-4682 added logs to track when AM RM token is updated for debuggability 
> purposes. However this is printed on every heartbeat and could cause the AM 
> logs to be flooded with this whenever RM's Master key is rolled over 
> especially if its a long running AM. Hence proposing to remove this log line. 
> As explained in 
> https://issues.apache.org/jira/browse/YARN-3104?focusedCommentId=14298692=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14298692
>  , the AM-RM connection  is not re-established so the updated token in the 
> client's UGI is never re-sent to the RPC server andRM continues to send the 
> token each heartbeat since it  cannot be sure whether the client really has 
> the new token. Hence the log lines are printed on every heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8213) Add Capacity Scheduler metrics

2018-04-26 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454579#comment-16454579
 ] 

Wangda Tan commented on YARN-8213:
--

Thanks [~cheersyang], took a quick look, haven't checked details of the patch.

{code} 
1220CapacitySchedulerMetrics.destroy();
{code} 

We shouldn't add CS-specific logics to RM. I think we should add an abstract 
SchedulerMetrics to: a. pull common scheduler metrics to the base class. b. 
avoid RM depends on any class of scheduler. 

[~ywskycn] was working on some CS related metrics changes, he may have some 
thoughts. 

> Add Capacity Scheduler metrics
> --
>
> Key: YARN-8213
> URL: https://issues.apache.org/jira/browse/YARN-8213
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler, metrics
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: YARN-8213.001.patch, YARN-8213.002.patch
>
>
> Currently when tune CS performance, it is not that straightforward because 
> lacking of metrics. Right now we only have \{{QueueMetrics}} which mostly 
> only tracks queue level resource counters. Propose to add CS metrics to 
> collect and display more fine-grained perf metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-04-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453297#comment-16453297
 ] 

Wangda Tan commented on YARN-4606:
--

Thanks [~eepayne] / [~maniraj...@gmail.com],

Here's my understanding of the proposed approach: 

1) When we compute {{max-am-resource-per-user}}, we uses #active-users + 
#pending-users.
2) When we compute {{max-user-limit}}, we use #active-users only. 

To me this is correct and (seems) same as what I proposed previously:
{code}
We should only consider a user is "active" if any of its application is active. 
And CS will use the "#active-user-which-has-at-least-one-active-app" to compute 
user-limit.

Computation of max-am-resource-per-user needs to be updated as well. We should 
get a #users-which-has-pending-apps to compute max-am-resource-per-user.
{code}

I haven't checked very much details of the patch since [~maniraj...@gmail.com] 
is working on update the tests, etc. Just one suggestion is: AppSchedulingInfo 
is supports to cache status for pending resource, it might be better to avoid 
invoking SchedulerAppAttempt's method from AppSchedulingInfo.

> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Assignee: Manikandan R
>Priority: Critical
> Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, 
> YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-04-25 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453046#comment-16453046
 ] 

Wangda Tan commented on YARN-8193:
--

+1, thanks [~Zian Chen], will commit by today if no objections.

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Critical
> Attachments: YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8183) Fix ConcurrentModificationException inside RMAppAttemptMetrics#convertAtomicLongMaptoLongMap

2018-04-24 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8183:
-
Summary: Fix ConcurrentModificationException inside 
RMAppAttemptMetrics#convertAtomicLongMaptoLongMap  (was: yClient for Kill 
Application stuck in infinite loop with message "Waiting for Application to be 
killed")

> Fix ConcurrentModificationException inside 
> RMAppAttemptMetrics#convertAtomicLongMaptoLongMap
> 
>
> Key: YARN-8183
> URL: https://issues.apache.org/jira/browse/YARN-8183
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Sumana Sathish
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8183.1.patch, YARN-8183.2.patch
>
>
> yclient gets stuck in killing application with repeatedly printing following 
> message
> {code}
> INFO impl.YarnClientImpl: Waiting for application 
> application_1523604760756_0001 to be killed.{code}
> RM shows following exception
> {code}
>  ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(995)) - 
> Error in handling event type APP_UPDATE_SAVED for application application_ID
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1442)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1476)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.convertAtomicLongMaptoLongMap(RMAppAttemptMetrics.java:212)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:133)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1660)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.appFinished(TimelineServiceV2Publisher.java:178)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.CombinedSystemMetricsPublisher.appFinished(CombinedSystemMetricsPublisher.java:73)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalTransition.transition(RMAppImpl.java:1470)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AppKilledTransition.transition(RMAppImpl.java:1408)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AppKilledTransition.transition(RMAppImpl.java:1400)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1164)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:898)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:118)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:993)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:977)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2

2018-04-24 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451385#comment-16451385
 ] 

Wangda Tan commented on YARN-8200:
--

+1 to have a branch for this which we can easier know which patches got 
backported.

> Backport resource types/GPU features to branch-2
> 
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8183) yClient for Kill Application stuck in infinite loop with message "Waiting for Application to be killed"

2018-04-24 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16451339#comment-16451339
 ] 

Wangda Tan commented on YARN-8183:
--

Thanks [~suma.shivaprasad], +1, pending Jenkins.

> yClient for Kill Application stuck in infinite loop with message "Waiting for 
> Application to be killed"
> ---
>
> Key: YARN-8183
> URL: https://issues.apache.org/jira/browse/YARN-8183
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Sumana Sathish
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8183.1.patch, YARN-8183.2.patch
>
>
> yclient gets stuck in killing application with repeatedly printing following 
> message
> {code}
> INFO impl.YarnClientImpl: Waiting for application 
> application_1523604760756_0001 to be killed.{code}
> RM shows following exception
> {code}
>  ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(995)) - 
> Error in handling event type APP_UPDATE_SAVED for application application_ID
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1442)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1476)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.convertAtomicLongMaptoLongMap(RMAppAttemptMetrics.java:212)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:133)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1660)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.appFinished(TimelineServiceV2Publisher.java:178)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.CombinedSystemMetricsPublisher.appFinished(CombinedSystemMetricsPublisher.java:73)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalTransition.transition(RMAppImpl.java:1470)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AppKilledTransition.transition(RMAppImpl.java:1408)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AppKilledTransition.transition(RMAppImpl.java:1400)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1164)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:898)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:118)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:993)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:977)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2

2018-04-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449312#comment-16449312
 ] 

Wangda Tan commented on YARN-8200:
--

[~chris.douglas], I think [~sunilg] has already pointed out, the multiple 
resource type backport could be very tricky. IIRC, [~templedf] spent lots of 
time to backport from trunk to branch-3.0 in the last year and several issues 
caused by backport. And now it diverges more, we have more changes (about 5+ 
months) added to trunk including many scheduler related changes.

[~shv], I understand you want a bridge release. I'm still +1 to have a 2.x 
bridge release and backporting GPU related changes to branch-2. But it might be 
worthwhile to look at 3.x release and fix migration issues so all users who 
want to migrate to 3.x can benefit from such efforts. Just my $0.02.

> Backport resource types/GPU features to branch-2
> 
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-04-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449075#comment-16449075
 ] 

Wangda Tan commented on YARN-4606:
--

Thanks [~eepayne] / [~maniraj...@gmail.com] for working on the fix. I just 
unassigned myself, please feel free to assign to you if you plan to do that.

I'm going to check the patch / approach in the next two days.

> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Priority: Critical
> Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, 
> YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-04-23 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-4606:


Assignee: (was: Wangda Tan)

> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Priority: Critical
> Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, 
> YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8200) Backport resource types/GPU features to branch-2

2018-04-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448943#comment-16448943
 ] 

Wangda Tan commented on YARN-8200:
--

[~jhung], I would suggest to try use 3.x instead back porting this to 2.x so 
everybody is on the same codebase and improvement it. To me, the effort of 
backporting YARN-3926 + YARN-6223 will be comparable to upgrading a 3.x release 
and fixing (incompatible) issues. Both of the features are more than 0.5 MB and 
change many files.

I'm fine with backporting this to branch-2, but backporting itself could be 
very tricky.

> Backport resource types/GPU features to branch-2
> 
>
> Key: YARN-8200
> URL: https://issues.apache.org/jira/browse/YARN-8200
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
>
> Currently we have a need for GPU scheduling on our YARN clusters to support 
> deep learning workloads. However, our main production clusters are running 
> older versions of branch-2 (2.7 in our case). To prevent supporting too many 
> very different hadoop versions across multiple clusters, we would like to 
> backport the resource types/resource profiles feature to branch-2, as well as 
> the GPU specific support.
>  
> We have done a trial backport of YARN-3926 and some miscellaneous patches in 
> YARN-7069 based on issues we uncovered, and the backport was fairly smooth. 
> We also did a trial backport of most of YARN-6223 (sans docker support).
>  
> Regarding the backports, perhaps we can do the development in a feature 
> branch and then merge to branch-2 when ready.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8183) yClient for Kill Application stuck in infinite loop with message "Waiting for Application to be killed"

2018-04-23 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448511#comment-16448511
 ] 

Wangda Tan commented on YARN-8183:
--

Thanks [~suma.shivaprasad], 

Overall the fix looks good.
Not related to the patch, but one existing logic need to be updated. 
{code}
  public Resource getResourcePreempted() {
try {
  readLock.lock();
  return resourcePreempted;
} finally {
  readLock.unlock();
}
  }
{code}
Instead of doing this, we should return resourcePreempted.clone() so the 
consumer can get a copy of the resource. 

> yClient for Kill Application stuck in infinite loop with message "Waiting for 
> Application to be killed"
> ---
>
> Key: YARN-8183
> URL: https://issues.apache.org/jira/browse/YARN-8183
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Sumana Sathish
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8183.1.patch
>
>
> yclient gets stuck in killing application with repeatedly printing following 
> message
> {code}
> INFO impl.YarnClientImpl: Waiting for application 
> application_1523604760756_0001 to be killed.{code}
> RM shows following exception
> {code}
>  ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(995)) - 
> Error in handling event type APP_UPDATE_SAVED for application application_ID
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1442)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1476)
> at java.util.HashMap$EntryIterator.next(HashMap.java:1474)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.convertAtomicLongMaptoLongMap(RMAppAttemptMetrics.java:212)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:133)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.getRMAppMetrics(RMAppImpl.java:1660)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.appFinished(TimelineServiceV2Publisher.java:178)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.CombinedSystemMetricsPublisher.appFinished(CombinedSystemMetricsPublisher.java:73)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalTransition.transition(RMAppImpl.java:1470)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AppKilledTransition.transition(RMAppImpl.java:1408)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$AppKilledTransition.transition(RMAppImpl.java:1400)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$FinalStateSavedTransition.transition(RMAppImpl.java:1164)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:898)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:118)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:993)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:977)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8169) Review RackResolver.java

2018-04-18 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442456#comment-16442456
 ] 

Wangda Tan commented on YARN-8169:
--

[~belugabehr], thanks for the clarification, very helpful!

> Review RackResolver.java
> 
>
> Key: YARN-8169
> URL: https://issues.apache.org/jira/browse/YARN-8169
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.0.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: YARN-8169.1.patch, YARN.8169.2.patch
>
>
> # Use SLF4J
> # Fix some checkstyle warnings
> # Minor clean up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8169) Review RackResolver.java

2018-04-18 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16442087#comment-16442087
 ] 

Wangda Tan commented on YARN-8169:
--

[~belugabehr], 

it's better to keep the:
{code:java}
if (LOG.isDebugEnabled()) { ... } {code}
Check for performance.

> Review RackResolver.java
> 
>
> Key: YARN-8169
> URL: https://issues.apache.org/jira/browse/YARN-8169
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.0.1
>Reporter: BELUGA BEHR
>Assignee: BELUGA BEHR
>Priority: Trivial
> Attachments: YARN-8169.1.patch
>
>
> # Use SLF4J
> # Fix some checkstyle warnings
> # Minor clean up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438548#comment-16438548
 ] 

Wangda Tan commented on YARN-8135:
--

And just attached the WIP POC patch (poc.001), I know this is very early and 
incomplete, but want to post it to here to get some feedbacks.

*What it completed:*

1) Run training job (single node).

2) Support user specify docker images.

3) Support DNS for tasks (like worker0.tfjob001..).

4) Support easier access of HDFS.

5) Support GPU isolation.

*What to do next for POC:*

1) Model serving. (WIP).

2) Distributed training. (WIP)

3) Determine development plans.

Will be out for conference in the upcoming week, please expect some delays of 
my responses.

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8135.poc.001.patch
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Attachment: YARN-8135.poc.001.patch

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-8135.poc.001.patch
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438543#comment-16438543
 ] 

Wangda Tan commented on YARN-8135:
--

I just removed some contents from description, and put a link to the on-going 
design doc. 
[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]

Please feel free to add your thoughts / feedbacks.

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Description: 
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

h3. {color:#FF}Please refer to on-going design doc, and add your thoughts: 
{color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}

  was:
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

Please refer to on-going design doc, and add your thoughts: 
[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]


> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#FF}Please refer to on-going design doc, and add your 
> thoughts: 
> {color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Description: 
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

Please refer to on-going design doc, and add your thoughts: 
[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]

  was:
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

Compare to other projects:

!image-2018-04-09-14-44-41-101.png!

*Notes:*

*GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

**XLearning needs few modification to read ClusterSpec from env.

*References:*
 - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
 - TensorFlowOnYARN (Intel): [https://github.com/Intel-bigdata/TensorFlowOnYARN]
 - Spark Deep Learning (Databricks): 
[https://github.com/databricks/spark-deep-learning]
 - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
 - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]


> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Please refer to on-going design doc, and add your thoughts: 
> [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Attachment: (was: image-2018-04-09-14-44-41-101.png)

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Attachment: (was: image-2018-04-09-14-35-16-778.png)

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8138) Add unit test to validate queue priority preemption works under node partition.

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8138:
-
Issue Type: Sub-task  (was: Bug)
Parent: YARN-8159

> Add unit test to validate queue priority preemption works under node 
> partition.
> ---
>
> Key: YARN-8138
> URL: https://issues.apache.org/jira/browse/YARN-8138
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Minor
> Attachments: YARN-8138.001.patch, YARN-8138.002.patch, 
> YARN-8138.003.patch
>
>
> Add unit test to validate queue priority preemption works under node 
> partition.
> Test configuration:
>  queue A (capacity=50, priority=1)
>  queue B (capacity=50, priority=2)
>  both have accessible-node-labels set to x
>  A.accessible-node-labels.x.capacity = 50
>  B.accessible-node-labels.x.capacity = 50
>  Along with this pre-emption related properties have been set.
> Test steps:
>  - Submit an application A1 to B, with am-container = container = 4096, no. 
> of containers = 4
>  - Submit an application A2 to A, with am-container = 1024, container = 2048, 
> no of containers = (NUM_NM-1)
>  - Kill application A1
>  - Submit an application A3 to B with am-container=container=5210, no. of 
> containers=NUM_NM
>  - Expectation is that containers are pre-empted from application A2 to A3



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8159) [Umbrella] Fixes for Multiple Resource Type Preemption in Capacity Scheduler

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8159:
-
Description: 
There're a couple of JIRAs open for multiple resource types preemption in CS. 
It might be better to group them to make sure everybody is on the same page. 

In addition to that, I don't believe our preemption logic can properly handle 
multiple resource type preemption when YARN-5881 is being used. (Different 
percentage of shares for different resource types). We may need some overhaul 
of the preemption logics for that.

  was:We see a couple of 


> [Umbrella] Fixes for Multiple Resource Type Preemption in Capacity Scheduler
> 
>
> Key: YARN-8159
> URL: https://issues.apache.org/jira/browse/YARN-8159
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Zian Chen
>Priority: Major
>
> There're a couple of JIRAs open for multiple resource types preemption in CS. 
> It might be better to group them to make sure everybody is on the same page. 
> In addition to that, I don't believe our preemption logic can properly handle 
> multiple resource type preemption when YARN-5881 is being used. (Different 
> percentage of shares for different resource types). We may need some overhaul 
> of the preemption logics for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8159) [Umbrella] Fixes for Multiple Resource Type Preemption in Capacity Scheduler

2018-04-14 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8159:
-
Description: We see a couple of 

> [Umbrella] Fixes for Multiple Resource Type Preemption in Capacity Scheduler
> 
>
> Key: YARN-8159
> URL: https://issues.apache.org/jira/browse/YARN-8159
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Zian Chen
>Priority: Major
>
> We see a couple of 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6538) Inter Queue preemption is not happening when DRF is configured

2018-04-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-6538:
-
Issue Type: Sub-task  (was: Bug)
Parent: YARN-8159

> Inter Queue preemption is not happening when DRF is configured
> --
>
> Key: YARN-6538
> URL: https://issues.apache.org/jira/browse/YARN-6538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
>
> Cluster capacity of . Here memory is more and vcores 
> are less. If applications have more demand, vcores might be exhausted. 
> Inter queue preemption ideally has to be kicked in once vcores is over 
> utilized. However preemption is not happening.
> Analysis:
> In {{AbstractPreemptableResourceCalculator.computeFixpointAllocation}}, 
> {code}
> // assign all cluster resources until no more demand, or no resources are
> // left
> while (!orderedByNeed.isEmpty() && Resources.greaterThan(rc, totGuarant,
> unassigned, Resources.none())) {
> {code}
>  will loop even when vcores are 0 (because memory is still +ve). Hence we are 
> having more vcores in idealAssigned which cause no-preemption cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8020) when DRF is used, preemption does not trigger due to incorrect idealAssigned

2018-04-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8020:
-
Issue Type: Sub-task  (was: Bug)
Parent: YARN-8159

> when DRF is used, preemption does not trigger due to incorrect idealAssigned
> 
>
> Key: YARN-8020
> URL: https://issues.apache.org/jira/browse/YARN-8020
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: kyungwan nam
>Priority: Major
>
> I’ve met that Inter Queue Preemption does not work.
> It happens when DRF is used and submitting application with a large number of 
> vcores.
> IMHO, idealAssigned can be set incorrectly by following code.
> {code}
> // This function "accepts" all the resources it can (pending) and return
> // the unused ones
> Resource offer(Resource avail, ResourceCalculator rc,
> Resource clusterResource, boolean considersReservedResource) {
>   Resource absMaxCapIdealAssignedDelta = Resources.componentwiseMax(
>   Resources.subtract(getMax(), idealAssigned),
>   Resource.newInstance(0, 0));
>   // accepted = min{avail,
>   //   max - assigned,
>   //   current + pending - assigned,
>   //   # Make sure a queue will not get more than max of its
>   //   # used/guaranteed, this is to make sure preemption won't
>   //   # happen if all active queues are beyond their guaranteed
>   //   # This is for leaf queue only.
>   //   max(guaranteed, used) - assigned}
>   // remain = avail - accepted
>   Resource accepted = Resources.min(rc, clusterResource,
>   absMaxCapIdealAssignedDelta,
>   Resources.min(rc, clusterResource, avail, Resources
>   /*
>* When we're using FifoPreemptionSelector (considerReservedResource
>* = false).
>*
>* We should deduct reserved resource from pending to avoid 
> excessive
>* preemption:
>*
>* For example, if an under-utilized queue has used = reserved = 20.
>* Preemption policy will try to preempt 20 containers (which is not
>* satisfied) from different hosts.
>*
>* In FifoPreemptionSelector, there's no guarantee that preempted
>* resource can be used by pending request, so policy will preempt
>* resources repeatly.
>*/
>   .subtract(Resources.add(getUsed(),
>   (considersReservedResource ? pending : pendingDeductReserved)),
>   idealAssigned)));
> {code}
> let’s say,
> * cluster resource : 
> * idealAssigned(assigned): 
> * avail: 
> * current: 
> * pending: 
> current + pending - assigned: 
> min ( avail, (current + pending - assigned) ) : 
> accepted: 
> as a result, idealAssigned will be , which does not 
> trigger preemption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436488#comment-16436488
 ] 

Wangda Tan commented on YARN-8149:
--

[~tgraves], 

Preemption for large reserved container is already handled by existing code 
path, it won't guarantee all reserved container can be satisfied, but it can 
alleviate the problem a log: https://issues.apache.org/jira/browse/YARN-4390.

I agree that we cannot remove this method in anytime soon (sadly), let's think 
more about how to better do reservation + preemption. I added 
moveReservedContainer (Swap) to CS part of YARN-5864. It is possible that we 
can consume that method to do better reservation. 

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8138) Add unit test to validate queue priority preemption works under node partition.

2018-04-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436342#comment-16436342
 ] 

Wangda Tan commented on YARN-8138:
--

[~Zian Chen], mind to check the failed unit tests as well as checkstyle issues. 

> Add unit test to validate queue priority preemption works under node 
> partition.
> ---
>
> Key: YARN-8138
> URL: https://issues.apache.org/jira/browse/YARN-8138
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Minor
> Attachments: YARN-8138.001.patch, YARN-8138.002.patch
>
>
> Add unit test to validate queue priority preemption works under node 
> partition.
> Test configuration:
>  queue A (capacity=50, priority=1)
>  queue B (capacity=50, priority=2)
>  both have accessible-node-labels set to x
>  A.accessible-node-labels.x.capacity = 50
>  B.accessible-node-labels.x.capacity = 50
>  Along with this pre-emption related properties have been set.
> Test steps:
>  - Submit an application A1 to B, with am-container = container = 4096, no. 
> of containers = 4
>  - Submit an application A2 to A, with am-container = 1024, container = 2048, 
> no of containers = (NUM_NM-1)
>  - Kill application A1
>  - Submit an application A3 to B with am-container=container=5210, no. of 
> containers=NUM_NM
>  - Expectation is that containers are pre-empted from application A2 to A3



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436327#comment-16436327
 ] 

Wangda Tan commented on YARN-8149:
--

Thanks [~tgraves] for the suggestions. 

To your question:
{quote}are you going to do anything with starvation then or allocation a 
certain % more then what is required?
{quote}
Not yet, 
{quote}Is in queue preemption on by default?
{quote}
No, but we see a large number of users / clusters enable this.

Probably what we should do is make it configurable and test it in a large 
cluster, run for a long time, and remove it only if we're confident about it.

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7930) Add configuration to initialize RM with configured labels.

2018-04-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436322#comment-16436322
 ] 

Wangda Tan commented on YARN-7930:
--

[~asuresh], [~abmodi], 

We thought about this when do initial design of node label.

Problem of doing this is, node labels could be removed after RM restart. 

For example:
{code:java}
In config, preconfigure a/b/c
After RM start, add d and remove a {code}
In this case, what we should do after RM restart? Should we read from config 
again or skip read the config and only read from node label store? 

We have to deal with some other corner cases so we don't do it in the 
beginning. 

I will be fine if you could think it through about corner cases and how users 
to use this feature.

> Add configuration to initialize RM with configured labels.
> --
>
> Key: YARN-7930
> URL: https://issues.apache.org/jira/browse/YARN-7930
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Abhishek Modi
>Assignee: Abhishek Modi
>Priority: Major
> Attachments: YARN-7930.001.patch, YARN-7930.002.patch, 
> YARN-7930.003.patch, YARN-7930.004.patch, YARN-7930.005.patch
>
>
> At present, the only way to create labels is using admin API. Sometimes, 
> there is a requirement to start the cluster with pre-configured node labels. 
> This Jira introduces yarn configurations to start RM with predefined node 
> labels.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-12 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436109#comment-16436109
 ] 

Wangda Tan commented on YARN-8149:
--

Thanks [~cheersyang] for pointing to the original Jira. 

I would say this could be more harmful than useful: re-reservation can be as 
large as MAX_INT, which means an app could reserve on many node even if the app 
has only one pending large resource request. With preemption enhancements like 
surgical preemption, etc. I think we don't need this any more.

Still want to hear thoughts from others before taking action.

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8138) Add unit test to validate queue priority preemption works under node partition.

2018-04-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8138:
-
Target Version/s: 3.2.0, 3.1.1

> Add unit test to validate queue priority preemption works under node 
> partition.
> ---
>
> Key: YARN-8138
> URL: https://issues.apache.org/jira/browse/YARN-8138
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Minor
> Attachments: YARN-8138.001.patch, YARN-8138.002.patch
>
>
> There seems to be an issue with pre-emption when using node labels with queue 
> priority.
> Test configuration:
> queue A (capacity=50, priority=1)
> queue B (capacity=50, priority=2)
> both have accessible-node-labels set to x
> A.accessible-node-labels.x.capacity = 50
> B.accessible-node-labels.x.capacity = 50
> Along with this pre-emption related properties have been set.
> Test steps:
>  - Set NM memory = 6000MB and containerMemory = 750MB
>  - Submit an application A1 to B, with am-container = container = 
> (6000-750-1500), no. of containers = 2
>  - Submit an application A2 to A, with am-container = 750, container = 1500, 
> no of containers = (NUM_NM-1)
>  - Kill application A1
>  - Submit an application A3 to B with am-container=container=5000, no. of 
> containers=3
>  - Expectation is that containers are pre-empted from application A2 to A3 
> but there is no container pre-emption happening
> Container pre-emption is stuck with the message in the RM log,
> {noformat}
> 2018-02-02 11:41:36,974 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,984 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster=
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,984 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster=
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,994 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,995 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster={noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8138) Add unit test to validate queue priority preemption works under node partition.

2018-04-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8138:
-
Summary: Add unit test to validate queue priority preemption works under 
node partition.  (was: No containers pre-empted from another queue when using 
node labels)

> Add unit test to validate queue priority preemption works under node 
> partition.
> ---
>
> Key: YARN-8138
> URL: https://issues.apache.org/jira/browse/YARN-8138
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Minor
> Attachments: YARN-8138.001.patch, YARN-8138.002.patch
>
>
> There seems to be an issue with pre-emption when using node labels with queue 
> priority.
> Test configuration:
> queue A (capacity=50, priority=1)
> queue B (capacity=50, priority=2)
> both have accessible-node-labels set to x
> A.accessible-node-labels.x.capacity = 50
> B.accessible-node-labels.x.capacity = 50
> Along with this pre-emption related properties have been set.
> Test steps:
>  - Set NM memory = 6000MB and containerMemory = 750MB
>  - Submit an application A1 to B, with am-container = container = 
> (6000-750-1500), no. of containers = 2
>  - Submit an application A2 to A, with am-container = 750, container = 1500, 
> no of containers = (NUM_NM-1)
>  - Kill application A1
>  - Submit an application A3 to B with am-container=container=5000, no. of 
> containers=3
>  - Expectation is that containers are pre-empted from application A2 to A3 
> but there is no container pre-emption happening
> Container pre-emption is stuck with the message in the RM log,
> {noformat}
> 2018-02-02 11:41:36,974 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,984 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster=
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,984 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster=
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,994 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,995 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster={noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8138) No containers pre-empted from another queue when using node labels

2018-04-11 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8138:
-
Priority: Minor  (was: Blocker)

> No containers pre-empted from another queue when using node labels
> --
>
> Key: YARN-8138
> URL: https://issues.apache.org/jira/browse/YARN-8138
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Minor
> Attachments: YARN-8138.001.patch, YARN-8138.002.patch
>
>
> There seems to be an issue with pre-emption when using node labels with queue 
> priority.
> Test configuration:
> queue A (capacity=50, priority=1)
> queue B (capacity=50, priority=2)
> both have accessible-node-labels set to x
> A.accessible-node-labels.x.capacity = 50
> B.accessible-node-labels.x.capacity = 50
> Along with this pre-emption related properties have been set.
> Test steps:
>  - Set NM memory = 6000MB and containerMemory = 750MB
>  - Submit an application A1 to B, with am-container = container = 
> (6000-750-1500), no. of containers = 2
>  - Submit an application A2 to A, with am-container = 750, container = 1500, 
> no of containers = (NUM_NM-1)
>  - Kill application A1
>  - Submit an application A3 to B with am-container=container=5000, no. of 
> containers=3
>  - Expectation is that containers are pre-empted from application A2 to A3 
> but there is no container pre-emption happening
> Container pre-emption is stuck with the message in the RM log,
> {noformat}
> 2018-02-02 11:41:36,974 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,984 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster=
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,984 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster=
> 2018-02-02 11:41:36,984 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2673)) - Allocation proposal accepted
> 2018-02-02 11:41:36,994 INFO capacity.CapacityScheduler 
> (CapacityScheduler.java:allocateContainerOnSingleNode(1391)) - Trying to 
> fulfill reservation for application application_1517571510094_0003 on node: 
> XX:25454
> 2018-02-02 11:41:36,995 INFO allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(97)) - 
> Reserved container application=application_1517571510094_0003 
> resource= 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@3f04848e
>  cluster={noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8018) Yarn Service Upgrade: Add support for initiating service upgrade

2018-04-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434524#comment-16434524
 ] 

Wangda Tan commented on YARN-8018:
--

Thanks [~eyang] , I think the whole service related APIs are marked as unstable 
in the 3.1.0 release. 

It will be fine to include incomplete fixes to native service as far as it 
incorporates with Hadoop compatibility policy. I would prefer to do some of 
these changes earlier rather than doing it after 2 months and missed 
dependencies / fixes, etc. 

Thoughts?

> Yarn Service Upgrade: Add support for initiating service upgrade
> 
>
> Key: YARN-8018
> URL: https://issues.apache.org/jira/browse/YARN-8018
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8018-branch-3.1.007.patch, YARN-8018.001.patch, 
> YARN-8018.002.patch, YARN-8018.003.patch, YARN-8018.004.patch, 
> YARN-8018.005.patch, YARN-8018.006.patch, YARN-8018.007.patch
>
>
> Add support for initiating service upgrade which includes the following main 
> changes:
>  # Service API to initiate upgrade
>  # Persist service version on hdfs
>  # Start the upgraded version of service



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8127) Resource leak when async scheduling is enabled

2018-04-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434388#comment-16434388
 ] 

Wangda Tan commented on YARN-8127:
--

Nice catching! Thanks [~Tao Yang] / [~cheersyang]!

> Resource leak when async scheduling is enabled
> --
>
> Key: YARN-8127
> URL: https://issues.apache.org/jira/browse/YARN-8127
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Weiwei Yang
>Assignee: Tao Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8127.001.patch, YARN-8127.002.patch, 
> YARN-8127.003.patch, YARN-8127.004.patch
>
>
> Brief steps to reproduce
>  # Enable async scheduling, 5 threads
>  # Submit a lot of jobs trying to exhaust cluster resource
>  # After a while, observed NM allocated resource is more than resource 
> requested by allocated containers
> Looks like the commit phase is not sync handling reserved containers, causing 
> some proposal incorrectly accepted, subsequently resource was deducted 
> multiple times for a container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434342#comment-16434342
 ] 

Wangda Tan commented on YARN-8149:
--

[~jlowe] / [~eepayne] / [~cheersyang] / [~Tao Yang] / [~sunilg]. 

Could u share your thoughts on this? If we can remove this, reservation logic 
can be simplified a lot.

> Revisit behavior of Re-Reservation in Capacity Scheduler
> 
>
> Key: YARN-8149
> URL: https://issues.apache.org/jira/browse/YARN-8149
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
> not that easy to understand:
> Inside: 
> {{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
> {code:java}
> starvation = re-reservation / (#reserved-container * 
>  (1 - min(requested-resource / max-alloc, 
>   max-alloc - min-alloc / max-alloc))
> should_allocate = starvation + requiredContainers - reservedContainers > 
> 0{code}
> I think we should be able to remove the starvation computation, just to check 
> requiredContainers > reservedContainers should be enough.
> In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
> YARN-7636. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8149) Revisit behavior of Re-Reservation in Capacity Scheduler

2018-04-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8149:


 Summary: Revisit behavior of Re-Reservation in Capacity Scheduler
 Key: YARN-8149
 URL: https://issues.apache.org/jira/browse/YARN-8149
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Frankly speaking, I'm not sure why we need the re-reservation. The formula is 
not that easy to understand:

Inside: 
{{org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#shouldAllocOrReserveNewContainer}}
{code:java}
starvation = re-reservation / (#reserved-container * 
 (1 - min(requested-resource / max-alloc, 
  max-alloc - min-alloc / max-alloc))
should_allocate = starvation + requiredContainers - reservedContainers > 0{code}
I think we should be able to remove the starvation computation, just to check 
requiredContainers > reservedContainers should be enough.

In a large cluster, we can easily overflow re-reservation to MAX_INT, see 
YARN-7636. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7142) Support placement policy in yarn native services

2018-04-11 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434284#comment-16434284
 ] 

Wangda Tan commented on YARN-7142:
--

[~cheersyang], thanks for reviewing this Jira, I agree with [~gsaha]: unlike DS 
mostly for dev testing, placement spec of native service should be more clear. 
The proposed one in this Jira is clearer than DS spec for end user to use. 

Currently we're planning to backport several dependencies to branch-3.1 so 
YARN-7142 can be backported w/o modification and makes native service 
implementation less divergency between trunk and branch-3.1. Once YARN-8118 
backported, we can backport this one.

> Support placement policy in yarn native services
> 
>
> Key: YARN-7142
> URL: https://issues.apache.org/jira/browse/YARN-7142
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Billie Rinaldi
>Assignee: Gour Saha
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7142-branch-3.1.004.patch, YARN-7142.001.patch, 
> YARN-7142.002.patch, YARN-7142.003.patch, YARN-7142.004.patch
>
>
> Placement policy exists in the API but is not implemented yet.
> I have filed YARN-8074 to move the composite constraints implementation out 
> of this phase-1 implementation of placement policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7402) Federation V2: Global Optimizations

2018-04-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16433404#comment-16433404
 ] 

Wangda Tan commented on YARN-7402:
--

[~curino] / [~subru], thanks for working on this improvement, is there any 
design/explanation doc so we can understand the overall idea and scope?

> Federation V2: Global Optimizations
> ---
>
> Key: YARN-7402
> URL: https://issues.apache.org/jira/browse/YARN-7402
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: federation
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>Priority: Major
>
> YARN Federation today requires manual configuration of queues within each 
> sub-cluster, and each RM operates "in isolation". This has few issues:
> # Preemption is computed locally (and might far exceed the global need)
> # Jobs within a queue are forced to consume their resources "evenly" based on 
> queue mapping
> This umbrella JIRA tracks a new feature that leverages the 
> FederationStateStore as a synchronization mechanism among RMs, and allows for 
> allocation and preemption decisions to be based on a (close to up-to-date) 
> global view of the cluster allocation and demand. The JIRA also tracks 
> algorithms to automatically generate policies for Router and AMRMProxy to 
> shape the traffic to each sub-cluster, and general "maintenance" of the 
> FederationStateStore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8133) Doc link broken for yarn-service from overview page.

2018-04-10 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8133:
-
Fix Version/s: 3.2.0

> Doc link broken for yarn-service from overview page.
> 
>
> Key: YARN-8133
> URL: https://issues.apache.org/jira/browse/YARN-8133
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Rohith Sharma K S
>Assignee: Rohith Sharma K S
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8133.01.patch, YARN-8133.02.patch
>
>
> I see that documentation link broken from overview page. 
> Any link clicking from 
> http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html
>  page causing an error. 
> It looks like Overview page, redirecting with .md page which doesn't exist. 
> It should redirect to *.html page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8141) YARN Native Service: Respect YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec

2018-04-10 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8141:


 Summary: YARN Native Service: Respect 
YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS specified in service spec
 Key: YARN-8141
 URL: https://issues.apache.org/jira/browse/YARN-8141
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: Wangda Tan


Existing YARN native service overwrites 
YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS regardless if user 
specified this in service spec or not. It is important to allow user to mount 
local folders like /etc/passwd, etc.

Following logic overwrites the 
YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS environment:
{code:java}
StringBuilder sb = new StringBuilder();
for (Entry mount : mountPaths.entrySet()) {
  if (sb.length() > 0) {
sb.append(",");
  }
  sb.append(mount.getKey());
  sb.append(":");
  sb.append(mount.getValue());
}
env.put("YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS", 
sb.toString());{code}
Inside AbstractLauncher.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7494) Add muti node lookup support for better placement

2018-04-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432970#comment-16432970
 ] 

Wangda Tan commented on YARN-7494:
--

Thanks [~sunilg], 

In general change looks good. Could u check UT failures?

[~cheersyang] please commit the patch once you think it is ready.

> Add muti node lookup support for better placement
> -
>
> Key: YARN-7494
> URL: https://issues.apache.org/jira/browse/YARN-7494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
> Attachments: YARN-7494.001.patch, YARN-7494.002.patch, 
> YARN-7494.003.patch, YARN-7494.004.patch, YARN-7494.005.patch, 
> YARN-7494.006.patch, YARN-7494.v0.patch, YARN-7494.v1.patch, 
> multi-node-designProposal.png
>
>
> Instead of single node, for effectiveness we can consider a multi node lookup 
> based on partition to start with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8116) Nodemanager fails with NumberFormatException: For input string: ""

2018-04-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432938#comment-16432938
 ] 

Wangda Tan commented on YARN-8116:
--

+1, thanks [~csingh], will commit shortly.

> Nodemanager fails with NumberFormatException: For input string: ""
> --
>
> Key: YARN-8116
> URL: https://issues.apache.org/jira/browse/YARN-8116
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8116.001.patch, YARN-8116.002.patch
>
>
> Steps followed.
> 1) Update nodemanager debug delay config
> {code}
> 
>   yarn.nodemanager.delete.debug-delay-sec
>   350
> {code}
> 2) Launch distributed shell application multiple times
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn  jar 
> hadoop-yarn-applications-distributedshell-*.jar  -shell_command "sleep 120" 
> -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos/httpd-24-centos7:latest -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar 
> hadoop-yarn-applications-distributedshell-*.jar{code}
> 3) restart NM
> Nodemanager fails to start with below error.
> {code}
> {code:title=NM log}
> 2018-03-23 21:32:14,437 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:serviceInit(181)) - ContainersMonitor enabled: 
> true
> 2018-03-23 21:32:14,439 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:serviceInit(130)) - rollingMonitorInterval is set 
> as 3600. The logs will be aggregated every 3600 seconds
> 2018-03-23 21:32:14,455 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state INITED
> java.lang.NumberFormatException: For input string: ""
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:601)
>   at java.lang.Long.parseLong(Long.java:631)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:899)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:960)
> 2018-03-23 21:32:14,458 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:serviceStop(148)) - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
>  waiting for pending aggregation during exit
> 2018-03-23 21:32:14,460 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state 
> INITED
> java.lang.NumberFormatException: For input string: ""
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:601)
>   at java.lang.Long.parseLong(Long.java:631)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464)
>   at 
> 

[jira] [Commented] (YARN-7530) hadoop-yarn-services-api should be part of hadoop-yarn-services

2018-04-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432934#comment-16432934
 ] 

Wangda Tan commented on YARN-7530:
--

[~eyang], thanks for sharing ur thoughts.

To me, for currently scope of native service, it is already beyond a single / 
self-contained app on YARN:

1) YARN Service API is part of RM. 

2) After YARN-8048, system services can be deployed before running any other 
applications.

I think we should move API / Client code to proper places to avoid load native 
service client / API logics by using reflection.

This doesn't block anything for now, but I think it will be important to clean 
it up to get more contributions from community.

> hadoop-yarn-services-api should be part of hadoop-yarn-services
> ---
>
> Key: YARN-7530
> URL: https://issues.apache.org/jira/browse/YARN-7530
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Trivial
> Fix For: yarn-native-services
>
> Attachments: YARN-7530.001.patch
>
>
> Hadoop-yarn-services-api is currently a parallel project to 
> hadoop-yarn-services project.  It would be better if hadoop-yarn-services-api 
> is part of hadoop-yarn-services for correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7974) Allow updating application tracking url after registration

2018-04-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432922#comment-16432922
 ] 

Wangda Tan commented on YARN-7974:
--

[~jhung],

Thanks for working on the feature, I can see it's values. 

For implementation / API:

1) Have you considered only allowing AM to update the tracking URL? Which can 
solve some problems like: a. Need to properly check ACL to make the change. b. 
concurrent write tracking URL causes issue.

2) I think the updated tracking URL need to be persisted as well, otherwise RM 
restart causes update information cleared.

> Allow updating application tracking url after registration
> --
>
> Key: YARN-7974
> URL: https://issues.apache.org/jira/browse/YARN-7974
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-7974.001.patch, YARN-7974.002.patch
>
>
> Normally an application's tracking url is set on AM registration. We have a 
> use case for updating the tracking url after registration (e.g. the UI is 
> hosted on one of the containers).
> Currently we added a {{updateTrackingUrl}} API to ApplicationClientProtocol.
> We'll post the patch soon, assuming there are no issues with this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432492#comment-16432492
 ] 

Wangda Tan commented on YARN-8135:
--

[~oliverhuh...@gmail.com], 

There's no technical issues to make TF application to access HDFS. But it is 
really a overhead to use HDFS if the user doesn't have experience of Hadoop 
before [https://www.tensorflow.org/deploy/hadoop]. Just want to make this step 
easier. 

 

[~asuresh], 

Thanks for interested in this project. I'm not sure Hadoop-Submarine or 
YARN-Submarine, let's decide once I finish the design. 

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: image-2018-04-09-14-35-16-778.png, 
> image-2018-04-09-14-44-41-101.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8079) Support specify files to be downloaded (localized) before containers launched by YARN

2018-04-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8079:
-
Summary: Support specify files to be downloaded (localized) before 
containers launched by YARN  (was: YARN native service should respect source 
file of ConfigFile inside Service/Component spec)

> Support specify files to be downloaded (localized) before containers launched 
> by YARN
> -
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7530) hadoop-yarn-services-api should be part of hadoop-yarn-services

2018-04-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431548#comment-16431548
 ] 

Wangda Tan commented on YARN-7530:
--

A quick proposal for this:

 
- ApiServerClient/ServiceClient -> yarn-client
- ApiServer/WebApp -> yarn-server/native-service
- hadoop-yarn-services-core/api -> yarn-api/common

> hadoop-yarn-services-api should be part of hadoop-yarn-services
> ---
>
> Key: YARN-7530
> URL: https://issues.apache.org/jira/browse/YARN-7530
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Chandni Singh
>Priority: Trivial
> Fix For: yarn-native-services
>
> Attachments: YARN-7530.001.patch
>
>
> Hadoop-yarn-services-api is currently a parallel project to 
> hadoop-yarn-services project.  It would be better if hadoop-yarn-services-api 
> is part of hadoop-yarn-services for correctness.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431369#comment-16431369
 ] 

Wangda Tan commented on YARN-8135:
--

[~oliverhuh...@gmail.com], 

Thanks for the responses, 
{quote}what does w/o modification mean ?
{quote}
Without modification of vanilla TF program in order to run on the framework.
{quote}As far as Kubeflow is deployed in the same cluster as Hadoop, Kubeflow 
should be able to access HDFS, through libhdfs or webhdfs interface?
{quote}
Since tensorflow supports to read HDFS, ideally all platform can support this 
:). What I meant here is, TF read HDFS needs lots of configurations, and needs 
some specific optimization / considerations to make HDFS access from Docker 
container easier. Our on-going prototype covers some of this problem. 
{quote}ToS kind of supports GPU scheduling (not isolation) base on memory: if 
you ask for 1 GPU and a machine has 4 GPU, it asks for total memory * the 
portion of GPU you asked.
{quote}
This is not easy for user and cannot guarantee proper isolation, so I didn't 
put a (√) for ToS.

 

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: image-2018-04-09-14-35-16-778.png, 
> image-2018-04-09-14-44-41-101.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Description: 
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

Compare to other projects:

!image-2018-04-09-14-44-41-101.png!

*Notes:*

*GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

**XLearning needs few modification to read ClusterSpec from env.

*References:*
 - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
 - TensorFlowOnYARN (Intel): [https://github.com/Intel-bigdata/TensorFlowOnYARN]
 - Spark Deep Learning (Databricks): 
[https://github.com/databricks/spark-deep-learning]
 - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
 - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]

  was:
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can take human to deep places. B-)

Compare to other projects:

!image-2018-04-09-14-44-41-101.png!

*Notes:*

*GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

**XLearning needs few modification to read ClusterSpec from env.

*References:*
 - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
 - TensorFlowOnYARN (Intel): [https://github.com/Intel-bigdata/TensorFlowOnYARN]
 - Spark Deep Learning (Databricks): 
[https://github.com/databricks/spark-deep-learning]
 - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
 - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]


> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: image-2018-04-09-14-35-16-778.png, 
> image-2018-04-09-14-44-41-101.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Description: 
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can take human to deep places. B-)

Compare to other projects:

!image-2018-04-09-14-44-41-101.png!

*Notes:*

*GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

**XLearning needs few modification to read ClusterSpec from env.

*References:*
 - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
 - TensorFlowOnYARN (Intel): [https://github.com/Intel-bigdata/TensorFlowOnYARN]
 - Spark Deep Learning (Databricks): 
[https://github.com/databricks/spark-deep-learning]
 - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
 - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]

  was:
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can take human to deep places. B-)

Compare to other projects:

!image-2018-04-09-14-35-16-778.png!

*Notes:*

* GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

** XLearning needs few modification to read ClusterSpec from env.

*References:*

- TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark
- TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN
- Spark Deep Learning (Databricks): 
https://github.com/databricks/spark-deep-learning
- XLearning (Qihoo360): https://github.com/Qihoo360/XLearning
- Kubeflow (Google): https://github.com/kubeflow/kubeflow


> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: image-2018-04-09-14-35-16-778.png, 
> image-2018-04-09-14-44-41-101.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can take human to deep places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Attachment: image-2018-04-09-14-44-41-101.png

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: image-2018-04-09-14-35-16-778.png, 
> image-2018-04-09-14-44-41-101.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can take human to deep places. B-)
> Compare to other projects:
> !image-2018-04-09-14-35-16-778.png!
> *Notes:*
> * GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> ** XLearning needs few modification to read ClusterSpec from env.
> *References:*
> - TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark
> - TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN
> - Spark Deep Learning (Databricks): 
> https://github.com/databricks/spark-deep-learning
> - XLearning (Qihoo360): https://github.com/Qihoo360/XLearning
> - Kubeflow (Google): https://github.com/kubeflow/kubeflow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431333#comment-16431333
 ] 

Wangda Tan commented on YARN-8135:
--

I'm currently working on a design doc and a prototype, will share more details 
in the next several days.

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: image-2018-04-09-14-35-16-778.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can take human to deep places. B-)
> Compare to other projects:
> !image-2018-04-09-14-35-16-778.png!
> *Notes:*
> * GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> ** XLearning needs few modification to read ClusterSpec from env.
> *References:*
> - TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark
> - TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN
> - Spark Deep Learning (Databricks): 
> https://github.com/databricks/spark-deep-learning
> - XLearning (Qihoo360): https://github.com/Qihoo360/XLearning
> - Kubeflow (Google): https://github.com/kubeflow/kubeflow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-04-09 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8135:


 Summary: Hadoop {Submarine} Project: Simple and scalable 
deployment of deep learning training / serving jobs on Hadoop
 Key: YARN-8135
 URL: https://issues.apache.org/jira/browse/YARN-8135
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: image-2018-04-09-14-35-16-778.png

Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can take human to deep places. B-)

Compare to other projects:

!image-2018-04-09-14-35-16-778.png!

*Notes:*

* GPU Isolation of XLearning project is achieved by patched YARN, which is 
different from community’s GPU isolation solution.

** XLearning needs few modification to read ClusterSpec from env.

*References:*

- TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark
- TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN
- Spark Deep Learning (Databricks): 
https://github.com/databricks/spark-deep-learning
- XLearning (Qihoo360): https://github.com/Qihoo360/XLearning
- Kubeflow (Google): https://github.com/kubeflow/kubeflow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8116) Nodemanager fails with NumberFormatException: For input string: ""

2018-04-09 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431063#comment-16431063
 ] 

Wangda Tan commented on YARN-8116:
--

[~csingh], thanks for working on the fix. It's better to include a simple UT to 
avoid regression since this is in a critical path of NM recovery.

> Nodemanager fails with NumberFormatException: For input string: ""
> --
>
> Key: YARN-8116
> URL: https://issues.apache.org/jira/browse/YARN-8116
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Chandni Singh
>Priority: Critical
> Attachments: YARN-8116.001.patch
>
>
> Steps followed.
> 1) Update nodemanager debug delay config
> {code}
> 
>   yarn.nodemanager.delete.debug-delay-sec
>   350
> {code}
> 2) Launch distributed shell application multiple times
> {code}
> /usr/hdp/current/hadoop-yarn-client/bin/yarn  jar 
> hadoop-yarn-applications-distributedshell-*.jar  -shell_command "sleep 120" 
> -num_containers 1 -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=centos/httpd-24-centos7:latest -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true -jar 
> hadoop-yarn-applications-distributedshell-*.jar{code}
> 3) restart NM
> Nodemanager fails to start with below error.
> {code}
> {code:title=NM log}
> 2018-03-23 21:32:14,437 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:serviceInit(181)) - ContainersMonitor enabled: 
> true
> 2018-03-23 21:32:14,439 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:serviceInit(130)) - rollingMonitorInterval is set 
> as 3600. The logs will be aggregated every 3600 seconds
> 2018-03-23 21:32:14,455 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state INITED
> java.lang.NumberFormatException: For input string: ""
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:601)
>   at java.lang.Long.parseLong(Long.java:631)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:464)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:899)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:960)
> 2018-03-23 21:32:14,458 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:serviceStop(148)) - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
>  waiting for pending aggregation during exit
> 2018-03-23 21:32:14,460 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service NodeManager failed in state 
> INITED
> java.lang.NumberFormatException: For input string: ""
>   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>   at java.lang.Long.parseLong(Long.java:601)
>   at java.lang.Long.parseLong(Long.java:631)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState(NMLeveldbStateStoreService.java:350)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState(NMLeveldbStateStoreService.java:253)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:365)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> 

[jira] [Updated] (YARN-8133) Doc link broken for yarn-service from overview page.

2018-04-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8133:
-
Priority: Blocker  (was: Major)

> Doc link broken for yarn-service from overview page.
> 
>
> Key: YARN-8133
> URL: https://issues.apache.org/jira/browse/YARN-8133
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Blocker
> Attachments: YARN-8133.01.patch
>
>
> I see that documentation link broken from overview page. 
> Any link clicking from 
> http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html
>  page causing an error. 
> It looks like Overview page, redirecting with .md page which doesn't exist. 
> It should redirect to *.html page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8133) Doc link broken for yarn-service from overview page.

2018-04-09 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8133:
-
Target Version/s: 3.1.1

> Doc link broken for yarn-service from overview page.
> 
>
> Key: YARN-8133
> URL: https://issues.apache.org/jira/browse/YARN-8133
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Rohith Sharma K S
>Priority: Blocker
> Attachments: YARN-8133.01.patch
>
>
> I see that documentation link broken from overview page. 
> Any link clicking from 
> http://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html
>  page causing an error. 
> It looks like Overview page, redirecting with .md page which doesn't exist. 
> It should redirect to *.html page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-7142) Support placement policy in yarn native services

2018-04-05 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened YARN-7142:
--

Thanks [~gsaha], reopening to run Jenkins 

> Support placement policy in yarn native services
> 
>
> Key: YARN-7142
> URL: https://issues.apache.org/jira/browse/YARN-7142
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Billie Rinaldi
>Assignee: Gour Saha
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7142-branch-3.1.004.patch, YARN-7142.001.patch, 
> YARN-7142.002.patch, YARN-7142.003.patch, YARN-7142.004.patch
>
>
> Placement policy exists in the API but is not implemented yet.
> I have filed YARN-8074 to move the composite constraints implementation out 
> of this phase-1 implementation of placement policy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    5   6   7   8   9   10   11   12   13   14   >