[jira] [Updated] (YARN-8881) [YARN-8851] Add basic pluggable device plugin framework

2018-11-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8881:
-
Fix Version/s: 3.3.0

> [YARN-8851] Add basic pluggable device plugin framework
> ---
>
> Key: YARN-8881
> URL: https://issues.apache.org/jira/browse/YARN-8881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-8881-trunk.001.patch, YARN-8881-trunk.002.patch, 
> YARN-8881-trunk.003.patch, YARN-8881-trunk.004.patch, 
> YARN-8881-trunk.005.patch, YARN-8881-trunk.006.patch, 
> YARN-8881-trunk.007.patch, YARN-8881-trunk.008.patch, 
> YARN-8881-trunk.009.patch, YARN-8881-trunk.010.patch, 
> YARN-8881-trunk.011.patch, YARN-8881-trunk.012.patch
>
>
> It includes adding support in "ResourcePluginManager" to load plugin classes 
> based on configuration, an interface for the vendor to implement and the 
> adapter to decouple plugin and YARN internals. And the vendor device resource 
> discovery will be ready after this support



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8881) [YARN-8851] Add basic pluggable device plugin framework

2018-11-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8881:
-
Summary: [YARN-8851] Add basic pluggable device plugin framework  (was: 
Phase 1 - Add basic pluggable device plugin framework)

> [YARN-8851] Add basic pluggable device plugin framework
> ---
>
> Key: YARN-8881
> URL: https://issues.apache.org/jira/browse/YARN-8881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8881-trunk.001.patch, YARN-8881-trunk.002.patch, 
> YARN-8881-trunk.003.patch, YARN-8881-trunk.004.patch, 
> YARN-8881-trunk.005.patch, YARN-8881-trunk.006.patch, 
> YARN-8881-trunk.007.patch, YARN-8881-trunk.008.patch, 
> YARN-8881-trunk.009.patch, YARN-8881-trunk.010.patch, 
> YARN-8881-trunk.011.patch, YARN-8881-trunk.012.patch
>
>
> It includes adding support in "ResourcePluginManager" to load plugin classes 
> based on configuration, an interface for the vendor to implement and the 
> adapter to decouple plugin and YARN internals. And the vendor device resource 
> discovery will be ready after this support



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691972#comment-16691972
 ] 

Wangda Tan commented on YARN-8960:
--

+1, committing, thanks [~yuan_zac].

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch, YARN-8960.007.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8299:
-
Target Version/s:   (was: 3.1.2)

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8299:
-
Fix Version/s: 3.1.2

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0, 3.1.2
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689823#comment-16689823
 ] 

Wangda Tan commented on YARN-8299:
--

Committing to branch-3.1 now ..

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8299:
-
Priority: Critical  (was: Major)

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689805#comment-16689805
 ] 

Wangda Tan commented on YARN-8299:
--

Reopened to backport to 3.1.2

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8299:
-
Target Version/s: 3.1.2

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8299) Yarn Service Upgrade: Add GET APIs that returns instances matching query params

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened YARN-8299:
--

> Yarn Service Upgrade: Add GET APIs that returns instances matching query 
> params
> ---
>
> Key: YARN-8299
> URL: https://issues.apache.org/jira/browse/YARN-8299
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8299-branch-3.1.001.patch, YARN-8299.001.patch, 
> YARN-8299.002.patch, YARN-8299.003.patch, YARN-8299.004.patch, 
> YARN-8299.005.patch
>
>
> We need APIs that returns containers that match the query params. These are 
> needed so that we can find out what containers have been upgraded.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8779) Fix few discrepancies between YARN Service swagger spec and code

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8779:
-
Target Version/s: 3.2.0, 3.1.3  (was: 3.2.0, 3.1.2)

> Fix few discrepancies between YARN Service swagger spec and code
> 
>
> Key: YARN-8779
> URL: https://issues.apache.org/jira/browse/YARN-8779
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0, 3.1.1
>Reporter: Gour Saha
>Priority: Major
>
> Following issues were identified in YARN Service swagger definition during an 
> effort to integrate with a running service by generating Java and Go 
> client-side stubs from the spec -
>  
> 1.
> *restartPolicy* is wrong and should be *restart_policy*
>  
> 2.
> A DELETE request to a non-existing service (or a previously existing but 
> deleted service) throws an ApiException instead of something like 
> NotFoundException (the equivalent of 404). Note, DELETE of an existing 
> service behaves fine.
>  
> 3.
> The response code of DELETE request is 200. The spec says 204. Since the 
> response has a payload, the spec should be updated to 200 instead of 204.
>  
> 4.
>  _DefaultApi.java_ client's _appV1ServicesServiceNameGetWithHttpInfo_ method 
> does not return a Service object. Swagger definition has the below bug in GET 
> response of */app/v1/services/\{service_name}* -
> {code:java}
> type: object
> items:
>   $ref: '#/definitions/Service'
> {code}
> It should be -
> {code:java}
> $ref: '#/definitions/Service'
> {code}
>  
> 5.
> Serialization issues were seen in all enum classes - ServiceState.java, 
> ContainerState.java, ComponentState.java, PlacementType.java and 
> PlacementScope.java.
> Java client threw the below exception for ServiceState -
> {code:java}
> Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: 
> Cannot construct instance of 
> `org.apache.cb.yarn.service.api.records.ServiceState` (although at least one 
> Creator exists): no String-argument constructor/factory method to deserialize 
> from String value ('ACCEPTED')
>  at [Source: 
> (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream);
>  line: 1, column: 121] (through reference chain: 
> org.apache.cb.yarn.service.api.records.Service["state”])
> {code}
> For Golang we saw this for ContainerState -
> {code:java}
> ERRO[2018-08-12T23:32:31.851-07:00] During GET request: json: cannot 
> unmarshal string into Go struct field Container.state of type 
> yarnmodel.ContainerState 
> {code}
>  
> 6.
> *launch_time* actually returns an integer but swagger definition says date. 
> Hence, the following exception is seen on the client side -
> {code:java}
> Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: 
> Unexpected token (VALUE_NUMBER_INT), expected START_ARRAY: Expected array or 
> string.
>  at [Source: 
> (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream);
>  line: 1, column: 477] (through reference chain: 
> org.apache.cb.yarn.service.api.records.Service["components"]->java.util.ArrayList[0]->org.apache.cb.yarn.service.api.records.Component["containers"]->java.util.ArrayList[0]->org.apache.cb.yarn.service.api.records.Container["launch_time”])
> {code}
>  
> 8.
> *user.name* query param with a valid value is required for all API calls to 
> an unsecure cluster. This is not defined in the spec.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8161) ServiceState FLEX should be removed

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8161:
-
Target Version/s: 3.2.0, 3.1.3  (was: 3.2.0, 3.1.2)

> ServiceState FLEX should be removed
> ---
>
> Key: YARN-8161
> URL: https://issues.apache.org/jira/browse/YARN-8161
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Gour Saha
>Priority: Major
>
> ServiceState FLEX is not required to trigger flex up/down of containers and 
> should be removed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8366) Expose debug log information when user intend to enable GPU without setting nvidia-smi path

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8366:
-
Target Version/s: 3.2.0, 3.1.3  (was: 3.2.0, 3.1.2)

> Expose debug log information when user intend to enable GPU without setting 
> nvidia-smi path
> ---
>
> Key: YARN-8366
> URL: https://issues.apache.org/jira/browse/YARN-8366
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>
> Expose Debug information help user found the root cause of failure when user 
> don't make these two settings manually before enabling GPU on YARN
> 1. yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables in 
> yarn-site.xml
> 2. environment variable LD_LIBRARY_PATH



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8986) publish all exposed ports to random ports when using bridge network

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8986:
-
Fix Version/s: (was: 3.1.2)

> publish all exposed ports to random ports when using bridge network
> ---
>
> Key: YARN-8986
> URL: https://issues.apache.org/jira/browse/YARN-8986
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Charo Zhang
>Assignee: Charo Zhang
>Priority: Minor
>  Labels: Docker
> Attachments: YARN-8986.patch
>
>
> it's better to publish all exposed ports to random ports(-P) or support port 
> mapping(-p) for bridge network when using bridge network for docker container.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8986) publish all exposed ports to random ports when using bridge network

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8986:
-
Target Version/s: 3.1.3  (was: 3.1.2)

> publish all exposed ports to random ports when using bridge network
> ---
>
> Key: YARN-8986
> URL: https://issues.apache.org/jira/browse/YARN-8986
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Charo Zhang
>Assignee: Charo Zhang
>Priority: Minor
>  Labels: Docker
> Attachments: YARN-8986.patch
>
>
> it's better to publish all exposed ports to random ports(-P) or support port 
> mapping(-p) for bridge network when using bridge network for docker container.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8552) [DS] Container report fails for distributed containers

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8552:
-
Target Version/s: 3.1.3  (was: 3.1.2)

> [DS]  Container report fails for distributed containers
> ---
>
> Key: YARN-8552
> URL: https://issues.apache.org/jira/browse/YARN-8552
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> 2018-07-19 19:15:02,281 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1531994217928_0003_01_1099511627753 Container Transitioned from 
> ACQUIRED to RUNNING
> 2018-07-19 19:15:02,384 ERROR 
> org.apache.hadoop.yarn.server.webapp.ContainerBlock: Failed to read the 
> container container_1531994217928_0003_01_1099511627773.
> Container report failing for Distributed Scheduler containers. Currently all 
> the container are fetched from central RM so need to find alternative for the 
> same.
> {code}
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.yarn.exceptions.ContainerNotFoundException: 
> Container with id 'container_1531994217928_0003_01_1099511627773' doesn't 
> exist in RM.
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:499)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMContainerBlock.getContainerReport(RMContainerBlock.java:44)
> at 
> org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:82)
> at 
> org.apache.hadoop.yarn.server.webapp.ContainerBlock$1.run(ContainerBlock.java:79)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1688)
> ... 70 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8509) Total pending resource calculation in preemption should use user-limit factor instead of minimum-user-limit-percent

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8509:
-
Target Version/s: 3.2.0, 3.1.3  (was: 3.2.0, 3.1.2)

> Total pending resource calculation in preemption should use user-limit factor 
> instead of minimum-user-limit-percent
> ---
>
> Key: YARN-8509
> URL: https://issues.apache.org/jira/browse/YARN-8509
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>  Labels: capacityscheduler
> Attachments: YARN-8509.001.patch, YARN-8509.002.patch, 
> YARN-8509.003.patch, YARN-8509.004.patch, YARN-8509.005.patch
>
>
> In LeafQueue#getTotalPendingResourcesConsideringUserLimit, we calculate total 
> pending resource based on user-limit percent and user-limit factor which will 
> cap pending resource for each user to the minimum of user-limit pending and 
> actual pending. This will prevent queue from taking more pending resource to 
> achieve queue balance after all queue satisfied with its ideal allocation.
>   
>  We need to change the logic to let queue pending can go beyond userlimit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8453:
-
Target Version/s: 3.0.4, 3.1.3  (was: 3.0.4, 3.1.2)

> Additional Unit  tests to verify queue limit and max-limit with multiple 
> resource types
> ---
>
> Key: YARN-8453
> URL: https://issues.apache.org/jira/browse/YARN-8453
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.0.2
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8453.001.patch
>
>
> Post support of additional resource types other then CPU and Memory, it could 
> be possible that one such new resource is exhausted its quota on a given 
> queue. But other resources such as Memory / CPU is still there beyond its 
> guaranteed limit (under max-limit). Adding more units test to ensure we are 
> not starving such allocation requests



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8052) Move overwriting of service definition during flex to service master

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8052:
-
Target Version/s: 3.1.3  (was: 3.1.2)

> Move overwriting of service definition during flex to service master
> 
>
> Key: YARN-8052
> URL: https://issues.apache.org/jira/browse/YARN-8052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
>
> The overwrite of service definition during flex is done from the 
> ServiceClient. 
> During auto finalization of upgrade, the current service definition gets 
> overwritten as well by the service master. This creates a potential conflict. 
> Need to move the overwrite of service definition during flex to the 
> ServiceClient. 
> Discussed on YARN-8018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8417:
-
Target Version/s: 3.1.3  (was: 3.1.2)

> Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker 
> container.
> 
>
> Key: YARN-8417
> URL: https://issues.apache.org/jira/browse/YARN-8417
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Critical
>
> Currently, YARN NM passes JAVA_HOME, HDFS_HOME, CLASSPATH environments before 
> launching Docker container no matter if ENTRY_POINT is used or not. This will 
> overwrite environments defined inside Dockerfile (by using \{{ENV}}). For 
> Docker container, it actually doesn't make sense to pass JAVA_HOME, 
> HDFS_HOME, etc. because inside docker image we have a separate Java/Hadoop 
> installed or mounted to exactly same directory of host machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8657) User limit calculation should be read-lock-protected within LeafQueue

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8657:
-
Target Version/s: 3.2.1, 3.1.3  (was: 3.1.2, 3.2.1)

> User limit calculation should be read-lock-protected within LeafQueue
> -
>
> Key: YARN-8657
> URL: https://issues.apache.org/jira/browse/YARN-8657
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Critical
> Attachments: YARN-8657.001.patch, YARN-8657.002.patch
>
>
> When async scheduling is enabled, user limit calculation could be wrong: 
> It is possible that scheduler calculated a user_limit, but inside 
> {{canAssignToUser}} it becomes staled. 
> We need to protect user limit calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8234:
-
Target Version/s: 3.1.3  (was: 3.1.2)

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Critical
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of 
> system metrics publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
> event buffer in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When 
> enable batch publishing, we must avoid that the publisher waits for a batch 
> to be filled up and hold events in buffer for long time. So we add another 
> thread which send event's in the buffer periodically. This config sets the 
> interval of the cyclical sending thread. The default value is 60s.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-11-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8257:
-
Target Version/s: 3.1.3  (was: 3.1.2)

> Native service should automatically adding escapes for environment/launch cmd 
> before sending to YARN
> 
>
> Key: YARN-8257
> URL: https://issues.apache.org/jira/browse/YARN-8257
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Gour Saha
>Priority: Critical
>
> Noticed this issue while using native service: 
> Basically, when a string for environment / launch command contains chars like 
> ", /, `: it needs to be escaped twice.
> The first time is from json spec, because of json accept double quote only, 
> it needs an escape.
> The second time is from launch container, what we did for command line is: 
> (ContainerLaunch.java)
> {code:java}
> line("exec /bin/bash -c \"", StringUtils.join(" ", command), "\"");{code}
> And for environment:
> {code:java}
> line("export ", key, "=\"", value, "\"");{code}
> An example of launch_command: 
> {code:java}
> "launch_command": "export CLASSPATH=\\`\\$HADOOP_HDFS_HOME/bin/hadoop 
> classpath --glob\\`"{code}
> And example of environment:
> {code:java}
> "TF_CONFIG" : "{\\\"cluster\\\": {\\\"master\\\": 
> [\\\"master-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"ps\\\": 
> [\\\"ps-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"], \\\"worker\\\": 
> [\\\"worker-0.distributed-tf.ambari-qa.tensorflow.site:8000\\\"]}, 
> \\\"task\\\": {\\\"type\\\":\\\"${COMPONENT_NAME}\\\", 
> \\\"index\\\":${COMPONENT_ID}}, \\\"environment\\\":\\\"cloud\\\"}",{code}
> To improve usability, I think we should auto escape the input string once. 
> (For example, if user specified 
> {code}
> "TF_CONFIG": "\"key\""
> {code}
> We will automatically escape it to:
> {code}
> "TF_CONFIG": \\\"key\\\"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9030) Log aggregation changes to handle filesystems which do not support permissions

2018-11-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689755#comment-16689755
 ] 

Wangda Tan commented on YARN-9030:
--

[~suma.shivaprasad], it seems the logic of verifyAndCreateRemoteDir is wrong, 
here's my local edit of the method:

{code} 
  /**
   * Verify and create the remote log directory.
   */
  public void verifyAndCreateRemoteLogDir() {
// Checking the existence of the TLD
FileSystem remoteFS;
try {
  remoteFS = getFileSystem(conf);
} catch (IOException e) {
  throw new YarnRuntimeException("Unable to get Remote FileSystem instance",
  e);
}
boolean remoteExists = true;
Path remoteRootLogDir = getRemoteRootLogDir();
try {
  FsPermission perms = remoteFS.getFileStatus(remoteRootLogDir)
  .getPermission();
  if (!perms.equals(TLDIR_PERMISSIONS)) {
LOG.warn("Remote Root Log Dir [" + remoteRootLogDir
+ "] already exist, but with incorrect permissions. "
+ "Expected: [" + TLDIR_PERMISSIONS + "], Found: [" + perms + "]."
+ " The cluster may have problems with multiple users.");
  }
} catch (FileNotFoundException e) {
  remoteExists = false;
} catch (IOException e) {
  throw new YarnRuntimeException(
  "Failed to check permissions for dir [" + remoteRootLogDir + "]", e);
}

try {
  Path qualified = remoteRootLogDir.makeQualified(remoteFS.getUri(),
  remoteFS.getWorkingDirectory());

  if (!remoteExists) {
LOG.warn("Remote Root Log Dir [" + remoteRootLogDir
+ "] does not exist. Attempting to create it.");
  }

  remoteFS.mkdirs(qualified, new FsPermission(TLDIR_PERMISSIONS));

  // Not possible to query FileSystem API to check if it supports
  // chmod, chown etc. Hence resorting to catching exceptions here.
  // Remove when FS APi is ready
  try {
remoteFS.setPermission(qualified, new FsPermission(TLDIR_PERMISSIONS));
  } catch (UnsupportedOperationException use) {
LOG.info("Unable to set permissions for configured filesystem since"
+ " it does not support this", remoteFS.getScheme());
fsSupportsChmod = false;
  }

  UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
  String primaryGroupName = null;
  try {
primaryGroupName = loginUser.getPrimaryGroupName();
  } catch (IOException e) {
LOG.warn("No primary group found. The remote root log directory"
+ " will be created with the HDFS superuser being its group "
+ "owner. JobHistoryServer may be unable to read the directory.");
  }
  // set owner on the remote directory only if the primary group exists
  if (primaryGroupName != null) {
try {
  remoteFS.setOwner(qualified, loginUser.getShortUserName(),
  primaryGroupName);
} catch (UnsupportedOperationException use) {
  LOG.info("File System does not support setting user/group" + remoteFS
  .getScheme(), use);
}
  }
} catch (IOException e) {
  throw new YarnRuntimeException(
  "Failed to create remoteLogDir [" + remoteRootLogDir + "]", e);
}
  }
{code}

Several issues of previosu code: 
1)
{code}
  Path qualified = remoteRootLogDir.makeQualified(remoteFS.getUri(),
  remoteFS.getWorkingDirectory());
{code}
Need to be called in anycase (you're right, but should place it under try).

2) 
{code}
   remoteFS.mkdirs(qualified, new FsPermission(TLDIR_PERMISSIONS));
{code} 
Need to be called in anycase. (Not only when remoteExists == false).

3) Removed duplicated else block at the end.

4) Removed unused parameters. 

Does this make sense to you?

> Log aggregation changes to handle filesystems which do not support permissions
> --
>
> Key: YARN-9030
> URL: https://issues.apache.org/jira/browse/YARN-9030
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-9030.1.patch
>
>
> Some cloud storages like ADLS do not support permissions in which case they 
> throw an UnsupportedOperationException. Log aggregation should hanlde these 
> case and not set permissions for log aggregation base dir/ sub dirs 
> {noformat}
> 2018-11-12 15:37:28,726 WARN  logaggregation.LogAggregationService 
> (LogAggregationService.java:initApp(209)) - Application failed to init 
> aggregation
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to check 
> permissions for dir [abfs://testc...@test.blob.core.windows.net/app-logs]
> at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.LogAggregationFileCont

[jira] [Commented] (YARN-8881) Phase 1 - Add basic pluggable device plugin framework

2018-11-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16689699#comment-16689699
 ] 

Wangda Tan commented on YARN-8881:
--

+1 to the latest patch, will commit later today if no objections.

Thanks

> Phase 1 - Add basic pluggable device plugin framework
> -
>
> Key: YARN-8881
> URL: https://issues.apache.org/jira/browse/YARN-8881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8881-trunk.001.patch, YARN-8881-trunk.002.patch, 
> YARN-8881-trunk.003.patch, YARN-8881-trunk.004.patch, 
> YARN-8881-trunk.005.patch, YARN-8881-trunk.006.patch, 
> YARN-8881-trunk.007.patch, YARN-8881-trunk.008.patch, 
> YARN-8881-trunk.009.patch, YARN-8881-trunk.010.patch, 
> YARN-8881-trunk.011.patch, YARN-8881-trunk.012.patch
>
>
> It includes adding support in "ResourcePluginManager" to load plugin classes 
> based on configuration, an interface for the vendor to implement and the 
> adapter to decouple plugin and YARN internals. And the vendor device resource 
> discovery will be ready after this support



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2018-11-14 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8917:
-
Target Version/s: 3.2.1

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8917.001.patch, YARN-8917.002.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2018-11-14 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8917:
-
Priority: Critical  (was: Major)

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Critical
> Attachments: YARN-8917.001.patch, YARN-8917.002.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9020) set a wrong AbsoluteCapacity when call ParentQueue#setAbsoluteCapacity

2018-11-14 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-9020.
--
Resolution: Duplicate

Thanks [~jutia] for reporting this. It is a valid issue.

This is dup of YARN-8917, [~Tao Yang] has put a patch already. Closing this as 
dup.

> set a wrong AbsoluteCapacity when call  ParentQueue#setAbsoluteCapacity
> ---
>
> Key: YARN-9020
> URL: https://issues.apache.org/jira/browse/YARN-9020
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: tianjuan
>Assignee: tianjuan
>Priority: Major
>
> set a wrong AbsoluteCapacity when call  ParentQueue#setAbsoluteCapacity
> private void deriveCapacityFromAbsoluteConfigurations(String label,
>  Resource clusterResource, ResourceCalculator rc, CSQueue childQueue) {
> // 3. Update absolute capacity as a float based on parent's minResource and
>  // cluster resource.
>  childQueue.getQueueCapacities().setAbsoluteCapacity(label,
>  (float) childQueue.getQueueCapacities().{color:#d04437}getCapacity(){color}
>  / getQueueCapacities().getAbsoluteCapacity(label));
>  
> {color:#d04437}should be{color} 
> childQueue.getQueueCapacities().setAbsoluteCapacity(label,
>  (float) 
> childQueue.getQueueCapacities().{color:#f6c342}getCapacity(label){color}
>  / getQueueCapacities().getAbsoluteCapacity(label));



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2018-11-14 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686903#comment-16686903
 ] 

Wangda Tan commented on YARN-8917:
--

This JIRA somehow dropped from our radar, retriggering Jenkins job and will get 
it committed.

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8917.001.patch, YARN-8917.002.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6223) [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

2018-11-14 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reassigned YARN-6223:


Assignee: Wangda Tan  (was: Antal Bálint Steinbach)

> [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation 
> on YARN
> 
>
> Key: YARN-6223
> URL: https://issues.apache.org/jira/browse/YARN-6223
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-6223.Natively-support-GPU-on-YARN-v1.pdf, 
> YARN-6223.wip.1.patch, YARN-6223.wip.2.patch, YARN-6223.wip.3.patch
>
>
> As varieties of workloads are moving to YARN, including machine learning / 
> deep learning which can speed up by leveraging GPU computation power. 
> Workloads should be able to request GPU from YARN as simple as CPU and memory.
> *To make a complete GPU story, we should support following pieces:*
> 1) GPU discovery/configuration: Admin can either config GPU resources and 
> architectures on each node, or more advanced, NodeManager can automatically 
> discover GPU resources and architectures and report to ResourceManager 
> 2) GPU scheduling: YARN scheduler should account GPU as a resource type just 
> like CPU and memory.
> 3) GPU isolation/monitoring: once launch a task with GPU resources, 
> NodeManager should properly isolate and monitor task's resource usage.
> For #2, YARN-3926 can support it natively. For #3, YARN-3611 has introduced 
> an extensible framework to support isolation for different resource types and 
> different runtimes.
> *Related JIRAs:*
> There're a couple of JIRAs (YARN-4122/YARN-5517) filed with similar goals but 
> different solutions:
> For scheduling:
> - YARN-4122/YARN-5517 are all adding a new GPU resource type to Resource 
> protocol instead of leveraging YARN-3926.
> For isolation:
> - And YARN-4122 proposed to use CGroups to do isolation which cannot solve 
> the problem listed at 
> https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as 
> minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver 
> versions, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685750#comment-16685750
 ] 

Wangda Tan commented on YARN-9001:
--

Pushed to trunk, but backport to branch-3.2 failed, [~yuan_zac], if you have 
bandwidth, could u check is there any changes of submarine which we forgot to 
push to branch-3.2? Ideally we should have all fixes of submarine went to 
branch-3.2

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch, YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-9001:
-
Fix Version/s: 3.3.0

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch, YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685618#comment-16685618
 ] 

Wangda Tan edited comment on YARN-8960 at 11/13/18 7:50 PM:


Thanks [~yuan_zac], 

Some comments: 
1) doLoginIfSecure, could u print login user if keytab/principal is empty? 
(Assume the user has login using kinit). We should fail the job submission if 
user doesn't login using kinit AND no keytab/principal specified AND security 
is enabled. And suggest to use Log.info instead of debug. 

2) Regarding to upload keytab, I'm a bit concerned about this behavior, instead 
of doing that, should we assume keytabs will be placed under all machine's 
directory? For example, if "zac" user has /security/keytabs/zac.keytab, the 
remote machine should have the same keytab on the same folder. Passing around 
keytab could be a high risk of the cluster.

If you think #2 is necessary, please at least make uploading keytab to an 
optional parameter, and add a note to command line description (Such as 
"distributing keytab to other machines is a risky operation to your 
credentials. Please consider options pre-distribute your keytab by admin as an 
alternative and more safety solution"). 



was (Author: leftnoteasy):
Thanks [~yuan_zac], 

Some comments: 
1) doLoginIfSecure, could u print login user if keytab/principal is empty? 
(Assume the user has login using kinit). We should fail the job submission if 
user doesn't login using kinit AND no keytab/principal specified AND security 
is enabled. And suggest to use Log.info instead of debug. 

2) Regarding to upload keytab, I'm a bit concerned about this behavior, instead 
of doing that, should we assume keytabs will be placed under all machine's 
directory? For example, if "zac" user has /security/keytabs/zac.keytab, the 
remote machine should have the same keytab on the same folder. Passing around 
keytab could be a high risk of the cluster.



> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685635#comment-16685635
 ] 

Wangda Tan commented on YARN-9001:
--

Rebased to latest trunk to run Jenkins.

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch, YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-9001:
-
Attachment: YARN-9001.005.patch

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch, YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685631#comment-16685631
 ] 

Wangda Tan commented on YARN-9001:
--

Thanks [~yuan_zac], +1, committing the patch. 

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685618#comment-16685618
 ] 

Wangda Tan commented on YARN-8960:
--

Thanks [~yuan_zac], 

Some comments: 
1) doLoginIfSecure, could u print login user if keytab/principal is empty? 
(Assume the user has login using kinit). We should fail the job submission if 
user doesn't login using kinit AND no keytab/principal specified AND security 
is enabled. And suggest to use Log.info instead of debug. 

2) Regarding to upload keytab, I'm a bit concerned about this behavior, instead 
of doing that, should we assume keytabs will be placed under all machine's 
directory? For example, if "zac" user has /security/keytabs/zac.keytab, the 
remote machine should have the same keytab on the same folder. Passing around 
keytab could be a high risk of the cluster.



> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-13 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8960:
-
Description: 
After submitting a submarine job, we tried to get service status using the 
following command:

yarn app -status ${service_name}

But we got the following error:

HTTP error code : 500

 

The stack in resourcemanager log is :

{code}
ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
java.lang.reflect.UndeclaredThrowableException
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
 at 
org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
 at 
org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ...
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
specified in the persisted service definitio
n, fail to connect to AM.
 at 
org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
 at 
org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
 at 
org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 ... 68 more
{code}

  was:
After submitting a submarine job, we tried to get service status using the 
following command:

yarn app -status ${service_name}

But we got the following error:

HTTP error code : 500

 

The stack in resourcemanager log is :

ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
java.lang.reflect.UndeclaredThrowableException
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
 at 
org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
 at 
org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
._dispatch(AbstractResourceMethodDispatchProvider.java:205)
 at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
ispatcher.java:75)
 at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
 at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
 at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
 at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:179)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
 at 
com.google.inject.servlet.

[jira] [Commented] (YARN-8881) Phase 1 - Add basic pluggable device plugin framework

2018-11-13 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685605#comment-16685605
 ] 

Wangda Tan commented on YARN-8881:
--

Thanks [~tangzhankun], 

Regarding Integer vs. int, I would suggest using negative values to indicate it 
is not set. Using null Integer will be confusing and could cause NPE if other 
code path doesn't aware of this.

In general, patch looks good to me. 

> Phase 1 - Add basic pluggable device plugin framework
> -
>
> Key: YARN-8881
> URL: https://issues.apache.org/jira/browse/YARN-8881
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8881-trunk.001.patch, YARN-8881-trunk.002.patch, 
> YARN-8881-trunk.003.patch, YARN-8881-trunk.004.patch, 
> YARN-8881-trunk.005.patch, YARN-8881-trunk.006.patch, 
> YARN-8881-trunk.007.patch, YARN-8881-trunk.008.patch
>
>
> It includes adding support in "ResourcePluginManager" to load plugin classes 
> based on configuration, an interface for the vendor to implement and the 
> adapter to decouple plugin and YARN internals. And the vendor device resource 
> discovery will be ready after this support



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-12 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16684468#comment-16684468
 ] 

Wangda Tan commented on YARN-9001:
--

[~yuan_zac], checked the patch, in general patch looks good, could u comment 
what tests you have done? 

Thanks,

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8993) [Submarine] Add support to run deep learning workload in non-Docker containers

2018-11-08 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8993:


 Summary: [Submarine] Add support to run deep learning workload in 
non-Docker containers
 Key: YARN-8993
 URL: https://issues.apache.org/jira/browse/YARN-8993
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Wangda Tan


Now Submarine can well support Docker container, there're some needs to run TF 
w/o Docker container. This JIRA is targeted to support non-docker container 
deep learning workload orchestration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8877) Extend service spec to allow setting resource attributes

2018-11-08 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680617#comment-16680617
 ] 

Wangda Tan commented on YARN-8877:
--

[~cheersyang], make sense to me.

> Extend service spec to allow setting resource attributes
> 
>
> Key: YARN-8877
> URL: https://issues.apache.org/jira/browse/YARN-8877
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8877.001.patch, YARN-8877.002.patch
>
>
> Extend yarn native service spec to support setting resource attributes in the 
> spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-08 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680615#comment-16680615
 ] 

Wangda Tan commented on YARN-8714:
--

[~tangzhankun] , I'm still not quite sure about:
{code:java}
/opt/script2.py->script2.py{code}
Does it mean the /opt/script.2 on the NM machine or Submarine will upload the 
local file /opt/script2.py to HDFS staging area first and localize to script2?

Regarding to the local file name "{{->dest}}", I think this is a bit confusing 
to me, instead, should we use ":" instead? 

Also, do you think should Submarine support upload local file to HDFS staging 
first and then localize to local folder?

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-08 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680601#comment-16680601
 ] 

Wangda Tan commented on YARN-8960:
--

[~yuan_zac] , as we discussed offline, do we still need the service principal? 
Or we should use user principal instead?

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:179)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)

[jira] [Updated] (YARN-8135) Hadoop {Submarine} Project: Simple and scalable deployment of deep learning training / serving jobs on Hadoop

2018-11-08 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8135:
-
Description: 
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

h3. {color:#ff}Please refer to on-going design doc, and add your thoughts: 
[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}

 

*{color:#33}See Also:{color}*
 * {color:#33}Zeppelin integration with Submarine design: 
[https://docs.google.com/document/d/16YN8Kjmxt1Ym3clx5pDnGNXGajUT36hzQxjaik1cP4A/edit#heading=h.4jov859x47qe]{color}

  was:
Description:

*Goals:*
 - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs on 
YARN.
 - Allow jobs easy access data/models in HDFS and other storages.
 - Can launch services to serve Tensorflow/MXNet models.
 - Support run distributed Tensorflow jobs with simple configs.
 - Support run user-specified Docker images.
 - Support specify GPU and other resources.
 - Support launch tensorboard if user specified.
 - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)

*Why this name?*
 - Because Submarine is the only vehicle can let human to explore deep places. 
B-)

h3. {color:#FF}Please refer to on-going design doc, and add your thoughts: 
{color:#33}[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}{color}


> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -
>
> Key: YARN-8135
> URL: https://issues.apache.org/jira/browse/YARN-8135
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> h3. {color:#ff}Please refer to on-going design doc, and add your 
> thoughts: 
> [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#|https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit?usp=sharing]{color}
>  
> *{color:#33}See Also:{color}*
>  * {color:#33}Zeppelin integration with Submarine design: 
> [https://docs.google.com/document/d/16YN8Kjmxt1Ym3clx5pDnGNXGajUT36hzQxjaik1cP4A/edit#heading=h.4jov859x47qe]{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8763) Add WebSocket logic to the Node Manager web server to establish servlet

2018-11-08 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16680570#comment-16680570
 ] 

Wangda Tan commented on YARN-8763:
--

[~sunilg] , I highly suggest reverting this from branch-3.2 if possible. Given 
this will be only utilized by interactive docker shell in the short term. 
Changes of the patch (all the way to CE) are unnecessarily risky to 3.2.0. We 
can consider backporting the docker interactive shell feature back to 3.2.x 
once we finish development and testing, etc. 

If you need to roll another RC, please get it reverted. I saw you created 
branch-3.2 on Oct 2nd, but this patch committed on Oct 5, may be it is 
committed to branch-3.2 by mistake.

If you haven't finished your RC artifacts, please get revert it. Otherwise 
please update fix version of the Jira to 3.2.0 to reflect the truth and revert 
it if another RC required.

Thanks,

> Add WebSocket logic to the Node Manager web server to establish servlet
> ---
>
> Key: YARN-8763
> URL: https://issues.apache.org/jira/browse/YARN-8763
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
>  Labels: Docker
> Fix For: 3.3.0
>
> Attachments: YARN-8763-001.patch, YARN-8763.002.patch, 
> YARN-8763.003.patch, YARN-8763.004.patch, YARN-8763.005.patch
>
>
> The reason we want to use WebSocket servlet to serve the backend instead of 
> establishing the connection through HTTP is that WebSocket solves a few 
> issues with HTTP which needed for our scenario,
>  # In HTTP, the request is always initiated by the client and the response is 
> processed by the server — making HTTP a unidirectional protocol, while web 
> socket provides the Bi-directional protocol which means either client/server 
> can send a message to the other party.
>  # Full-duplex communication — client and server can talk to each other 
> independently at the same time
>  # Single TCP connection — After upgrading the HTTP connection in the 
> beginning, client and server communicate over that same TCP connection 
> throughout the lifecycle of WebSocket connection



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8220:
-
Description: 
-Tensorflow could be run on YARN and could leverage YARN's distributed 
features.-

-This spec fill will help to run Tensorflow on yarn with GPU/docker-

 

Please go to YARN-8135 Submarine for deep learning framework support on YARN. 

  was:
Tensorflow could be run on YARN and could leverage YARN's distributed features.

This spec fill will help to run Tensorflow on yarn with GPU/docker


> Running Tensorflow on YARN with GPU and Docker - Examples
> -
>
> Key: YARN-8220
> URL: https://issues.apache.org/jira/browse/YARN-8220
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Critical
> Attachments: YARN-8220.001.patch, YARN-8220.002.patch, 
> YARN-8220.003.patch, YARN-8220.004.patch
>
>
> -Tensorflow could be run on YARN and could leverage YARN's distributed 
> features.-
> -This spec fill will help to run Tensorflow on yarn with GPU/docker-
>  
> Please go to YARN-8135 Submarine for deep learning framework support on YARN. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8237) mxnet yarn spec file to add to native service examples

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8237:
-
Description: 
Mxnet -could be run on YARN. This- jira -will help to add examples,- yarnfile-, 
docker files which are needed to run Mxnet on YARN.-

 

Please go to YARN-8135 Submarine for deep learning framework support on YARN. 

  was:Mxnet could be run on YARN. This jira will help to add examples, 
yarnfile, docker files which are needed to run Mxnet on YARN.


> mxnet yarn spec file to add to native service examples
> --
>
> Key: YARN-8237
> URL: https://issues.apache.org/jira/browse/YARN-8237
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> Mxnet -could be run on YARN. This- jira -will help to add examples,- 
> yarnfile-, docker files which are needed to run Mxnet on YARN.-
>  
> Please go to YARN-8135 Submarine for deep learning framework support on YARN. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8238) [Umbrella] YARN deep learning framework examples to run on native service

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8238:
-
Description: 
-Umbrella- jira -to track various deep learning frameworks which can run on 
yarn native services.-

 

Please go to YARN-8135 Submarine for deep learning framework support on YARN. 

  was:Umbrella jira to track various deep learning frameworks which can run on 
yarn native services.


> [Umbrella] YARN deep learning framework examples to run on native service
> -
>
> Key: YARN-8238
> URL: https://issues.apache.org/jira/browse/YARN-8238
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> -Umbrella- jira -to track various deep learning frameworks which can run on 
> yarn native services.-
>  
> Please go to YARN-8135 Submarine for deep learning framework support on YARN. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8237) mxnet yarn spec file to add to native service examples

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8237.
--
Resolution: Duplicate

> mxnet yarn spec file to add to native service examples
> --
>
> Key: YARN-8237
> URL: https://issues.apache.org/jira/browse/YARN-8237
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> Mxnet could be run on YARN. This jira will help to add examples, yarnfile, 
> docker files which are needed to run Mxnet on YARN.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8238) [Umbrella] YARN deep learning framework examples to run on native service

2018-11-06 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8238.
--
Resolution: Fixed

Closing as dup of YARN-8135. 

> [Umbrella] YARN deep learning framework examples to run on native service
> -
>
> Key: YARN-8238
> URL: https://issues.apache.org/jira/browse/YARN-8238
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn-native-services
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> Umbrella jira to track various deep learning frameworks which can run on yarn 
> native services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8877) Extend service spec to allow setting resource attributes

2018-11-06 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677039#comment-16677039
 ] 

Wangda Tan commented on YARN-8877:
--

[~cheersyang], 

If YARN-8940 will satisfy all needs for volume, should we just go ahead to 
finish YARN-8940 instead of adding this one?

> Extend service spec to allow setting resource attributes
> 
>
> Key: YARN-8877
> URL: https://issues.apache.org/jira/browse/YARN-8877
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8877.001.patch, YARN-8877.002.patch
>
>
> Extend yarn native service spec to support setting resource attributes in the 
> spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-06 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677029#comment-16677029
 ] 

Wangda Tan commented on YARN-8714:
--

[~tangzhankun] , could u please explain a little bit about what does this mean? 
{quote}hdfs:///user/yarn/script1.py->algorithm1.py 
/opt/script2.py->script2.py{quote}

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-06 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677030#comment-16677030
 ] 

Wangda Tan commented on YARN-8714:
--

+ [~liuxun323] / [~yuan_zac] to take a look at this as well.

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8902) Add volume manager that manages CSI volume lifecycle

2018-11-06 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677027#comment-16677027
 ] 

Wangda Tan commented on YARN-8902:
--

{quote}I prefer not to do this rename. As the package already has 
"resourcemanager", adding extra "rm" before "volume" seems redundant to me. 
What do you think?
{quote}
I don't strongly prefer this name, it's your call here :) 

> Add volume manager that manages CSI volume lifecycle
> 
>
> Key: YARN-8902
> URL: https://issues.apache.org/jira/browse/YARN-8902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8902.001.patch, YARN-8902.002.patch, 
> YARN-8902.003.patch, YARN-8902.004.patch, YARN-8902.005.patch, 
> YARN-8902.006.patch, YARN-8902.007.patch
>
>
> The CSI volume manager is a service running in RM process, that manages all 
> CSI volumes' lifecycle. The details about volume's lifecycle states can be 
> found in [CSI 
> spec|https://github.com/container-storage-interface/spec/blob/master/spec.md].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8858) CapacityScheduler should respect maximum node resource when per-queue maximum-allocation is being used.

2018-11-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16676232#comment-16676232
 ] 

Wangda Tan commented on YARN-8858:
--

Thanks [~cheersyang] / [~ajisakaa] for rebasing and committing the patch!

> CapacityScheduler should respect maximum node resource when per-queue 
> maximum-allocation is being used.
> ---
>
> Key: YARN-8858
> URL: https://issues.apache.org/jira/browse/YARN-8858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2, 3.3.0, 2.8.6
>
> Attachments: YARN-8858-branch-2.8.001.patch, 
> YARN-8858-branch-2.8.002.patch, YARN-8858.001.patch, YARN-8858.002.patch
>
>
> This issue happens after YARN-8720.
> Before that, AMS uses scheduler.getMaximumAllocation to do the normalization. 
> After that, AMS uses LeafQueue.getMaximumAllocation. The scheduler one uses 
> nodeTracker.getMaximumAllocation, but the LeafQueue.getMaximum doesn't. 
> We should use the scheduler.getMaximumAllocation to cap the per-queue's 
> maximum-allocation every time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8867) Retrieve the status of resource localization

2018-11-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675664#comment-16675664
 ] 

Wangda Tan commented on YARN-8867:
--

Thanks [~csingh] for working on this, from high-level I think the patch looks 
good. I may not have the bandwidth to review all details, so I will let others 
continue reviewing the patch.

Regarding percentage vs. diagnostic, I think diagnostic should be good enough 
for now. Once we can support percentage progress (or absolute value progress) 
from the backend, we can think more about how to add them to the protocol to 
avoid unnecessary protocol changes.

And one misc: could u please update your IDE preference to not adding import 
...* for packages? We typically discourage to do that to avoid backport 
conflict.

> Retrieve the status of resource localization
> 
>
> Key: YARN-8867
> URL: https://issues.apache.org/jira/browse/YARN-8867
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8867.001.patch, YARN-8867.002.patch, 
> YARN-8867.wip.patch
>
>
> Refer YARN-3854.
> Currently NM does not have an API to retrieve the status of localization. 
> Unless the client can know when the localization of a resource is complete 
> irrespective of the type of the resource, it cannot take any appropriate 
> action. 
> We need an API in {{ContainerManagementProtocol}} to retrieve the status on 
> the localization. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development

2018-11-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675606#comment-16675606
 ] 

Wangda Tan commented on YARN-8851:
--

Thanks [~tangzhankun] , 

1) Regarding to the NM_PLUGGABLE_DEVICE_FRAMEWORK_PREFER_CUSTOMIZED_SCHEDULER, 
should we just use default scheduler if device plugin doesn't provide their 
customized scheduler? We should assume that load device plugin runs "trusted" 
code, we may not need to add extra protection here.


2) DeviceSchedulerManager, it sounds like "manages scheduler", however it 
handles how to map device to containers, and scheduler is just implementation 
details. How about call it DeviceMappingManager?
- internalAssignDevices should be private, and it is a bit long, might be 
better for future maintenance if you can break it down to multiple methods.

I think we could move to make this POC to sub tasks and get them done piece by 
piece. It gonna be helpful if you can highlight subtasks required.

> [Umbrella] A new pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, 
> YARN-8851-WIP5-trunk.001.patch, YARN-8851-WIP6-trunk.001.patch, 
> YARN-8851-WIP7-trunk.001.patch, YARN-8851-WIP8-trunk.001.patch, 
> YARN-8851-WIP9-trunk.001.patch, YARN-8851-trunk.001.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-4.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8858) CapacityScheduler should respect maximum node resource when per-queue maximum-allocation is being used.

2018-11-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675572#comment-16675572
 ] 

Wangda Tan commented on YARN-8858:
--

Triggered Jenkins build to find flaky tests. 

> CapacityScheduler should respect maximum node resource when per-queue 
> maximum-allocation is being used.
> ---
>
> Key: YARN-8858
> URL: https://issues.apache.org/jira/browse/YARN-8858
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sumana Sathish
>Assignee: Wangda Tan
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2, 3.3.0
>
> Attachments: YARN-8858-branch-2.8.001.patch, 
> YARN-8858-branch-2.8.002.patch, YARN-8858.001.patch, YARN-8858.002.patch
>
>
> This issue happens after YARN-8720.
> Before that, AMS uses scheduler.getMaximumAllocation to do the normalization. 
> After that, AMS uses LeafQueue.getMaximumAllocation. The scheduler one uses 
> nodeTracker.getMaximumAllocation, but the LeafQueue.getMaximum doesn't. 
> We should use the scheduler.getMaximumAllocation to cap the per-queue's 
> maximum-allocation every time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8902) Add volume manager that manages CSI volume lifecycle

2018-11-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675532#comment-16675532
 ] 

Wangda Tan commented on YARN-8902:
--

Thanks [~cheersyang] , a couple of high-level questions and miscs: 

CsiAdaptorClient is not implemented, does this patch works end to end?

How to handle client ask volumes for every allocate request (let's say same 
volume id)? What will the expectation be for users, should they expect failures 
for the allocate() call or duplicated volume id will be simply ignored?

How to handle RM recovery case for volumes, are we going to recover volume 
states? or do we need to do that?

*Miscs:* 
- org.apache.hadoop.yarn.server.resourcemanager.volume => ..rmvolume?

> Add volume manager that manages CSI volume lifecycle
> 
>
> Key: YARN-8902
> URL: https://issues.apache.org/jira/browse/YARN-8902
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8902.001.patch, YARN-8902.002.patch, 
> YARN-8902.003.patch, YARN-8902.004.patch, YARN-8902.005.patch, 
> YARN-8902.006.patch, YARN-8902.007.patch
>
>
> The CSI volume manager is a service running in RM process, that manages all 
> CSI volumes' lifecycle. The details about volume's lifecycle states can be 
> found in [CSI 
> spec|https://github.com/container-storage-interface/spec/blob/master/spec.md].
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8877) Extend service spec to allow setting resource attributes

2018-11-05 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675496#comment-16675496
 ] 

Wangda Tan commented on YARN-8877:
--

Thanks [~cheersyang], 

In general patch looks good, could u update doc as well: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/yarn-service/YarnServiceAPI.md?

> Extend service spec to allow setting resource attributes
> 
>
> Key: YARN-8877
> URL: https://issues.apache.org/jira/browse/YARN-8877
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8877.001.patch
>
>
> Extend yarn native service spec to support setting resource attributes in the 
> spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-10-30 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669392#comment-16669392
 ] 

Wangda Tan commented on YARN-8714:
--

[~tangzhankun] , sure, please go ahead.

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Priority: Major
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8944) TestContainerAllocation.testUserLimitAllocationMultipleContainers failure after YARN-8896

2018-10-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665445#comment-16665445
 ] 

Wangda Tan commented on YARN-8944:
--

Thanks [~wilfreds] , Patch LGTM, will commit today.

> TestContainerAllocation.testUserLimitAllocationMultipleContainers failure 
> after YARN-8896
> -
>
> Key: YARN-8944
> URL: https://issues.apache.org/jira/browse/YARN-8944
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: capacity scheduler
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-8944.001.patch
>
>
> YARN-8896 changes the behaviour of the CapacityScheduler by limiting the 
> number of containers that can be allocated in one heartbeat. It is an 
> undocumented change in behaviour.
> The change breaks the junit test: 
> {{TestContainerAllocation.testUserLimitAllocationMultipleContainers}}
> The maximum number of containers that gets assigned via the on heartbeat is 
> 100 and it expects 199 to be assigned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8866) Fix a parsing error for crossdomain.xml

2018-10-26 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665444#comment-16665444
 ] 

Wangda Tan commented on YARN-8866:
--

Thanks [~tasanuma0829], 

Patch LGTM, will commit today.

> Fix a parsing error for crossdomain.xml
> ---
>
> Key: YARN-8866
> URL: https://issues.apache.org/jira/browse/YARN-8866
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, yarn-ui-v2
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
> Attachments: YARN-8866.1.patch
>
>
> [QBT|https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/] reports 
> a parsing error for crossdomain.xml in hadoop-yarn-ui.
> {noformat}
> Parsing Error(s): 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/public/crossdomain.xml
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-10-24 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8513.
--
   Resolution: Duplicate
Fix Version/s: (was: 3.2.1)
   (was: 3.1.2)

Reopen and closing as dup of YARN-8896

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 3.1.0, 2.9.1
> Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-10-24 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened YARN-8513:
--

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 3.1.0, 2.9.1
> Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8895) Improve YARN Error diagnostics

2018-10-24 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662914#comment-16662914
 ] 

Wangda Tan commented on YARN-8895:
--

[~youchen] , Thanks, I believe this will be a very useful Jira. My questions:

1) How to make it compatible. Typically we don't think changing "diagnostic" 
field is an incompatible change, however if everything changed, I would prefer 
to have a new field. 

2) What changes required to produce proposed structured msg.

> Improve YARN  Error diagnostics
> ---
>
> Key: YARN-8895
> URL: https://issues.apache.org/jira/browse/YARN-8895
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Young Chen
>Assignee: Young Chen
>Priority: Minor
>
> Currently identifying error sources can be quite difficult, as they are 
> written into an unstructured string "diagnostics" field. This is present in 
> container statuses returned to the RM and in application attempts in the RM. 
> These errors are difficult to classify without hard-coding diagnostic string 
> searches.
> This Jira aims to add a structured error field in NM and RM that preserves 
> failure information and source component to enable faster and clearer error 
> diagnosis
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Better handling of "docker.trusted.registries" in container-executor's "trusted_image_check" function

2018-10-23 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660904#comment-16660904
 ] 

Wangda Tan commented on YARN-8927:
--

[~tangzhankun], thanks for filing the Jira, I encountered the issue before and 
couldn't figure out why. 

It gonna be worthful to add a description to the container-executor.cfg about 
how to use this field. We use it as a self-explaining doc for c-e.cfg.

> Better handling of "docker.trusted.registries" in container-executor's 
> "trusted_image_check" function
> -
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development

2018-10-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659747#comment-16659747
 ] 

Wangda Tan commented on YARN-8851:
--

[~tangzhankun],

Thanks for updating the patch, the latest patch looks much better now.

One suggestion: 
 * The DevicePluginAdapter extends/implements 4 interfaces, Instead of doing 
that, is it possible to just make the Adapter implements ResourcePlugin 
interface, and make several "sub-adapter" to implement ResourceHandler, 
DockerCommandPlugin, and NMResourceUpdaterPlugin? By doing this, we can get a 
more grandularized interface definition and very much close to ResourcePlugin 
interface so less changes of integration code required.
 * I can understand most of the DevicePluginAdapter logics should be alike 
GPUResourcePlugin implementation, but some part will come from 
DeviceRuntimeSpec. It gonna be help to get more concrete implementation to see 
if our APIs properly designed or not.

And I haven't dig into details of code logics / naming, etc. while we're trying 
to sort out overall code structure.

> [Umbrella] A new pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, 
> YARN-8851-WIP5-trunk.001.patch, YARN-8851-WIP6-trunk.001.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8918) [Submarine] Correct method usage of str.subString in CliUtils

2018-10-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659725#comment-16659725
 ] 

Wangda Tan commented on YARN-8918:
--

+1, thanks [~tangzhankun].

> [Submarine] Correct method usage of str.subString in CliUtils
> -
>
> Key: YARN-8918
> URL: https://issues.apache.org/jira/browse/YARN-8918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-8918-trunk.001.patch, YARN-8918-trunk.002.patch, 
> YARN-8918-trunk.003.patch
>
>
> In CliUtils.java (line 74), there's uncorrect code block,:
> {code:java}
> if (resourcesStr.endsWith("]")) {
>  resourcesStr = resourcesStr.substring(0, resourcesStr.length());
> }{code}
> Above if block will execute "resourceStr = resourceStr". It should be 
> "length() -1"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8924) Refine the document or code related to legacy CPU isolation/enforcement

2018-10-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659723#comment-16659723
 ] 

Wangda Tan commented on YARN-8924:
--

Thanks [~tangzhankun] for filing and Jira and put analysis. 
[~shaneku...@gmail.com], I remember there're some discussions about deprecating 
the old handler. Could u suggest what we should do here?

> Refine the document or code related to legacy CPU isolation/enforcement
> ---
>
> Key: YARN-8924
> URL: https://issues.apache.org/jira/browse/YARN-8924
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
>
> This is to re-think the legacy configuration/code of CPU resource isolation. 
> In YARN-3542, we involve _CGroupsCpuResourceHandlerImpl_ based on new 
> _ResourceHandler_ mechanism but leaves the configuration 
> "yarn.nodemanager.linux-container-executor.resources-handler.class" there for 
> a long time. Now it seems confusing to the end user.
> Check YARN-6729, one sets "_DefaultLCEResourcesHandler_" and found that give 
> "percentage-physical-cpu-limit" a value less than "100" doesn't work.
> As far as I know, internally, the _CgroupsLCEResourcesHandler_ and 
> _DefaultLCEResourcesHandler are_ both deprecated. YARN won't use them anymore.
> Instead, YARN uses _CGroupsCpuResourceHandlerImpl_ to do CPU isolation and 
> only in LCE. If we want to enforce CPU usage, we must set LCE and 
> CgroupsLCEResourceHandler like this:
> {noformat}
> 
>   who will execute(launch) the containers.
>   yarn.nodemanager.container-executor.class
>   
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor
> 
> 
>  The class which should help the LCE handle 
> resources.
>  
> yarn.nodemanager.linux-container-executor.resources-handler.class
>  
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler
>  {noformat}
> Based on above settings can the CPU related settings like 
> "percentage-physical-cpu-limit" works as expected.
> To avoid confusing like YARN-6729, we can do two things:
>  # More clear document about how should user configure CPU 
> isolation/enforcement in "NodeManagerCgroups.md"
>  # Make "ResourceHandlerModuler" stable and remove legacy code and update the 
> document to recommend new setting "yarn.nodemanager.resource.cpu.enabled"
> Thoughts? [~leftnoteasy], [~vinodkv], [~vvasudev]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6167) RM option to delegate NM loss container action to AM

2018-10-22 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659719#comment-16659719
 ] 

Wangda Tan commented on YARN-6167:
--

Thanks [~billie.rinaldi],

1) Inside releaseContainers, why add following if? 
{code} 
  } else if (amHandlesNMLoss(rmApp)) {
LOG.debug("Adding " + containerId + " to the release request cache.");
attempt.getPendingRelease().add(containerId);
  }
{code} 

2) Rename amHandlesNMLoss => amSkipContainerKillWhenNMLoss. To make it more 
specific. 

3) Why changes of RMContainerImpl required?
{code} 
.addTransition(RMContainerState.RUNNING, RMContainerState.RUNNING,
RMContainerEventType.LAUNCHED)
{code}

4) Inside {{AbstractYarnScheduler}}, it invokes {{clearPendingContainerCache}}. 
For our case, we may have to skip cleaning containers which come from lost NM. 
I felt it makes things more complicated. Instead of handling "pendingRelease" 
for such containers, can we let AM handle it once NM comes back to normal 
state? AM should be notified after that.

> RM option to delegate NM loss container action to AM
> 
>
> Key: YARN-6167
> URL: https://issues.apache.org/jira/browse/YARN-6167
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: Billie Rinaldi
>Assignee: Billie Rinaldi
>Priority: Major
> Attachments: YARN-6167.01.patch
>
>
> Currently, if the RM times out an NM, the scheduler will kill all containers 
> that were running on the NM. For some applications, in the event of a 
> temporary NM outage, it might be better to delegate to the AM the decision 
> whether to kill the containers and request new containers from the RM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8920) LogAggregation should be configurable to allow writing to underlying storage as appOwner or yarn user

2018-10-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657413#comment-16657413
 ] 

Wangda Tan commented on YARN-8920:
--

Thanks [~suma.shivaprasad],

1) Inside YarnConfiguration, 
We should add a default field inside YarnConfiguration as well as 
yarn-default.xml for the new key. 
We should avoid use \{{conf.get(key, true)}} for easier maintenance. (Instead 
you can use conf.get(key, DEFAULT_KEY_VALUE))
And also you can add necessary documentation (via description) to 
yarn-default.xml for documentation purposes.

2) LogAggregationIndexedFileController#initializeWriterForApp, 
I saw now the user is set to yarn when the config is set to false.
{\{indexedLogsMeta.setUser(ugi.getShortUserName())}}

I'm concerned about this, since the log is still belongs to the user when we 
come to view the UI, etc. But the file is read/written by YARN user's 
credentials. We should still separate the two.

> LogAggregation should be configurable to allow writing to underlying storage 
> as appOwner or yarn user
> -
>
> Key: YARN-8920
> URL: https://issues.apache.org/jira/browse/YARN-8920
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation, yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8920.1.patch, YARN-8920.2.patch
>
>
> Currently NM Log Aggregation does not support writing to underlying storage 
> as "yarn" user.  This would be needed while writing storages like S3 which do 
> not support POSIX compliant ACLs and a single access key would be used for 
> writes and app owners will be allowed to read the logs with their own access 
> keys.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8917) Absolute (maximum) capacity of level3+ queues is wrongly calculated for absolute resource

2018-10-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657113#comment-16657113
 ] 

Wangda Tan commented on YARN-8917:
--

Nice catch [~Tao Yang]! Fix makes sense to me.

[~sunilg], this happens only under the absolute resource configuration code 
path, when cluster resource changed, percentage-based capacity needs to be 
updated as well.

> Absolute (maximum) capacity of level3+ queues is wrongly calculated for 
> absolute resource
> -
>
> Key: YARN-8917
> URL: https://issues.apache.org/jira/browse/YARN-8917
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.1
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8917.001.patch
>
>
> Absolute capacity should be equal to multiply capacity by parent-queue's 
> absolute-capacity,
> but currently it's calculated as dividing capacity by parent-queue's 
> absolute-capacity.
> Calculation for absolute-maximum-capacity has the same problem.
> For example: 
> root.a   capacity=0.4   maximum-capacity=0.8
> root.a.a1   capacity=0.5  maximum-capacity=0.6
> Absolute capacity of root.a.a1 should be 0.2 but is wrongly calculated as 1.25
> Absolute maximum capacity of root.a.a1 should be 0.48 but is wrongly 
> calculated as 0.75
> Moreover:
> {{childQueue.getQueueCapacities().getCapacity()}}  should be changed to 
> {{childQueue.getQueueCapacities().getCapacity(label)}} to avoid getting wrong 
> capacity from default partition when calculating for a non-default partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8916) Define a constant "docker" string in "ContainerRuntimeConstants.java" for better maintainability

2018-10-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8916:
-
Fix Version/s: 3.3.0

> Define a constant "docker" string in "ContainerRuntimeConstants.java" for 
> better maintainability
> 
>
> Key: YARN-8916
> URL: https://issues.apache.org/jira/browse/YARN-8916
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-8916-trunk.001.patch
>
>
> There're several hard-code string "docker" exists. It's better to use a 
> constant string in "ContainerRuntimeConstants" to make this container type 
> easy to use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8918) [Submarine] Remove redundant method of str.subString(0, str.length())

2018-10-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8918:
-
Fix Version/s: (was: 3.2.1)
   (was: 3.3.0)
   (was: 3.1.2)

> [Submarine] Remove redundant method of str.subString(0, str.length())
> -
>
> Key: YARN-8918
> URL: https://issues.apache.org/jira/browse/YARN-8918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-8918-trunk.001.patch
>
>
> In CliUtils.java (line 74), there's redundant code block that can be removed:
> {code:java}
> if (resourcesStr.endsWith("]")) {
>  resourcesStr = resourcesStr.substring(0, resourcesStr.length());
> }{code}
> Above if block will execute "resourceStr = resourceStr".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8918) [Submarine] Remove redundant method of str.subString(0, str.length())

2018-10-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8918:
-
Fix Version/s: 3.2.1
   3.3.0
   3.1.2

> [Submarine] Remove redundant method of str.subString(0, str.length())
> -
>
> Key: YARN-8918
> URL: https://issues.apache.org/jira/browse/YARN-8918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-8918-trunk.001.patch
>
>
> In CliUtils.java (line 74), there's redundant code block that can be removed:
> {code:java}
> if (resourcesStr.endsWith("]")) {
>  resourcesStr = resourcesStr.substring(0, resourcesStr.length());
> }{code}
> Above if block will execute "resourceStr = resourceStr".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8908) Fix errors in yarn-default.xml related to GPU/FPGA

2018-10-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8908:
-
Fix Version/s: 3.2.1
   3.3.0
   3.1.2

> Fix errors in yarn-default.xml related to GPU/FPGA
> --
>
> Key: YARN-8908
> URL: https://issues.apache.org/jira/browse/YARN-8908
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.2, 3.3.0, 3.2.1
>
> Attachments: YARN-8908-trunk.001.patch, YARN-8908-trunk.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8916) Define a constant "docker" string in "ContainerRuntimeConstants.java" for better maintainability

2018-10-19 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8916:
-
Fix Version/s: (was: 3.2.0)
   3.2.1

> Define a constant "docker" string in "ContainerRuntimeConstants.java" for 
> better maintainability
> 
>
> Key: YARN-8916
> URL: https://issues.apache.org/jira/browse/YARN-8916
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-8916-trunk.001.patch
>
>
> There're several hard-code string "docker" exists. It's better to use a 
> constant string in "ContainerRuntimeConstants" to make this container type 
> easy to use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8916) Define a constant "docker" string in "ContainerRuntimeConstants.java" for better maintainability

2018-10-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657057#comment-16657057
 ] 

Wangda Tan commented on YARN-8916:
--

+1, thanks [~tangzhankun].

> Define a constant "docker" string in "ContainerRuntimeConstants.java" for 
> better maintainability
> 
>
> Key: YARN-8916
> URL: https://issues.apache.org/jira/browse/YARN-8916
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-8916-trunk.001.patch
>
>
> There're several hard-code string "docker" exists. It's better to use a 
> constant string in "ContainerRuntimeConstants" to make this container type 
> easy to use.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8908) Fix errors in yarn-default.xml related to GPU/FPGA

2018-10-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657005#comment-16657005
 ] 

Wangda Tan commented on YARN-8908:
--

+1, thanks [~tangzhankun]. 

> Fix errors in yarn-default.xml related to GPU/FPGA
> --
>
> Key: YARN-8908
> URL: https://issues.apache.org/jira/browse/YARN-8908
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8908-trunk.001.patch, YARN-8908-trunk.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8918) [Submarine] Remove redundant method of str.subString(0, str.length())

2018-10-19 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657004#comment-16657004
 ] 

Wangda Tan commented on YARN-8918:
--

[~tangzhankun], 

I think the correct logic should be: 
{code:java}
if (resourcesStr.endsWith("]")) {
  resourcesStr = resourcesStr.substring(0, resourcesStr.length() - 1);
}{code}

> [Submarine] Remove redundant method of str.subString(0, str.length())
> -
>
> Key: YARN-8918
> URL: https://issues.apache.org/jira/browse/YARN-8918
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-8918-trunk.001.patch
>
>
> In CliUtils.java (line 74), there's redundant code block that can be removed:
> {code:java}
> if (resourcesStr.endsWith("]")) {
>  resourcesStr = resourcesStr.substring(0, resourcesStr.length());
> }{code}
> Above if block will execute "resourceStr = resourceStr".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6098) Add documentation for Delete Queue

2018-10-18 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655793#comment-16655793
 ] 

Wangda Tan commented on YARN-6098:
--

Backported to branch-3.1 as well.

> Add documentation for Delete Queue
> --
>
> Key: YARN-6098
> URL: https://issues.apache.org/jira/browse/YARN-6098
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, documentation
>Reporter: Naganarasimha G R
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.1.2, 3.2.1
>
> Attachments: YARN-6098.1.patch
>
>
> As per the discussion in YARN-5556, we need to document steps for  deleting a 
> queue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6098) Add documentation for Delete Queue

2018-10-18 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-6098:
-
Fix Version/s: 3.1.2

> Add documentation for Delete Queue
> --
>
> Key: YARN-6098
> URL: https://issues.apache.org/jira/browse/YARN-6098
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, documentation
>Reporter: Naganarasimha G R
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.1.2, 3.2.1
>
> Attachments: YARN-6098.1.patch
>
>
> As per the discussion in YARN-5556, we need to document steps for  deleting a 
> queue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8896) Limit the maximum number of container assignments per heartbeat

2018-10-18 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655790#comment-16655790
 ] 

Wangda Tan commented on YARN-8896:
--

Committed to trunk/branch-3.1/branch-3.2.

> Limit the maximum number of container assignments per heartbeat
> ---
>
> Key: YARN-8896
> URL: https://issues.apache.org/jira/browse/YARN-8896
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Weiwei Yang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.2, 3.2.1
>
> Attachments: YARN-8896-trunk.001.patch
>
>
> YARN-4161 adds a configuration 
> \{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> to control max number of container assignments per heartbeat, however the 
> default value is -1. This could potentially cause the CS gets stuck in the 
> while loop causing issue like YARN-8513. We should change this to a finite 
> number, e.g 100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8896) Limit the maximum number of container assignments per heartbeat

2018-10-18 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8896:
-
Fix Version/s: 3.2.1
   3.1.2

> Limit the maximum number of container assignments per heartbeat
> ---
>
> Key: YARN-8896
> URL: https://issues.apache.org/jira/browse/YARN-8896
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Weiwei Yang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.2, 3.2.1
>
> Attachments: YARN-8896-trunk.001.patch
>
>
> YARN-4161 adds a configuration 
> \{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> to control max number of container assignments per heartbeat, however the 
> default value is -1. This could potentially cause the CS gets stuck in the 
> while loop causing issue like YARN-8513. We should change this to a finite 
> number, e.g 100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8896) Limit the maximum number of container assignments per heartbeat

2018-10-18 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan resolved YARN-8896.
--
Resolution: Fixed

> Limit the maximum number of container assignments per heartbeat
> ---
>
> Key: YARN-8896
> URL: https://issues.apache.org/jira/browse/YARN-8896
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Weiwei Yang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.1.2, 3.2.1
>
> Attachments: YARN-8896-trunk.001.patch
>
>
> YARN-4161 adds a configuration 
> \{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> to control max number of container assignments per heartbeat, however the 
> default value is -1. This could potentially cause the CS gets stuck in the 
> while loop causing issue like YARN-8513. We should change this to a finite 
> number, e.g 100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-18 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655670#comment-16655670
 ] 

Wangda Tan commented on YARN-8489:
--

[~billie.rinaldi]/[~eyang], 

Suggestions make sense to me. I will +1 to 
{{yarn.service.container-state-report-as-service-state}}.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8456) Fix a configuration handling bug when user leave FPGA discover executable path configuration default but set OpenCL SDK path environment variable

2018-10-18 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655664#comment-16655664
 ] 

Wangda Tan commented on YARN-8456:
--

+1, thanks [~tangzhankun]. 

> Fix a configuration handling bug when user leave FPGA discover executable 
> path configuration default but set OpenCL SDK path environment variable
> -
>
> Key: YARN-8456
> URL: https://issues.apache.org/jira/browse/YARN-8456
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8456-trunk.001.patch, YARN-8456-trunk.002.patch
>
>
> *Issue:*
>  When the user doesn't configure 
> "yarn.nodemanager.resource-plugins.fpga.path-to-discovery-executables" in 
> yarn-site.xml and have "ALTERAOCLSDKROOT" environment variable set, the FPGA 
> discoverer cannot find the correct executable path (with 
> IntelFPGAOpenclPlugin).
> *Reason:*
> In IntelFPGAOpenclPlugin,  the current code builds a wrong path string after 
> getting the environment variable value. It should append "/bin/ name>" otherwise it would fail the FPGA resource discovery.
>  
> *Solution:*
> Fix the path construction code in IntelFPGAOpenclPlugin.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8870) [Submarine] Add submarine installation scripts

2018-10-18 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8870:
-
Fix Version/s: 3.2.0

> [Submarine] Add submarine installation scripts
> --
>
> Key: YARN-8870
> URL: https://issues.apache.org/jira/browse/YARN-8870
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8870.001.patch, YARN-8870.004.patch, 
> YARN-8870.005.patch, YARN-8870.006.patch, YARN-8870.007.patch
>
>
> In order to reduce the deployment difficulty of Hadoop
> {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel 
> modification and other components, I specially developed this installation 
> script to deploy Hadoop \{Submarine}
> runtime environment, providing one-click installation Scripts, which can also 
> be used to install, uninstall, start, and stop individual components step by 
> step.
>  
> {color:#ff}design d{color}{color:#FF}ocument:{color} 
> [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8896) Limit the maximum number of container assignments per heartbeat

2018-10-18 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655577#comment-16655577
 ] 

Wangda Tan commented on YARN-8896:
--

+1, patch LGTM, thanks [~tangzhankun]. 

> Limit the maximum number of container assignments per heartbeat
> ---
>
> Key: YARN-8896
> URL: https://issues.apache.org/jira/browse/YARN-8896
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.0, 3.0.0
>Reporter: Weiwei Yang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8896-trunk.001.patch
>
>
> YARN-4161 adds a configuration 
> \{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
> to control max number of container assignments per heartbeat, however the 
> default value is -1. This could potentially cause the CS gets stuck in the 
> while loop causing issue like YARN-8513. We should change this to a finite 
> number, e.g 100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8908) Fix errors in yarn-default.xml related to GPU/FPGA

2018-10-18 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655575#comment-16655575
 ] 

Wangda Tan commented on YARN-8908:
--

+1, patch LGMT. 

Thanks [~tangzhankun].

> Fix errors in yarn-default.xml related to GPU/FPGA
> --
>
> Key: YARN-8908
> URL: https://issues.apache.org/jira/browse/YARN-8908
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8908-trunk.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development

2018-10-17 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654192#comment-16654192
 ] 

Wangda Tan commented on YARN-8851:
--

Thanks [~tangzhankun],  mostly high level comments.  item #6 will be most 
important and fundamental of the feature. 

1) Regarding to version compatibility:
{code:java}
 
 // Check version for compatibility
 String pluginVersion = request.getVersion();
 if (!isVersionCompatible(pluginVersion)) {
 LOG.error("Class: " + pluginClassName + " version: " + pluginVersion +
 " is not compatible. Expected: " + DeviceConstants.version);
 }
{code}
What's the use case for this? My understanding is, version match should happen 
when requests come to NM. And I'm not sure if it is the best idea to limit 
format of version, maybe we should just treat it as an identifier in addition 
to name?

2) Instead of adding two configs:
{code:java}
 @Private
 public static final String 
NM_RESOURCE_PLUGINS_ENABLE_PLUGGABLE_DEVICE_FRAMEWORK =
 NM_RESOURCE_PLUGINS + ".pluggable-device-framework.enable";


 @Private
 public static final String NM_RESOURCE_PLUGINS_PLUGGABLE_CLASS =
 NM_RESOURCE_PLUGINS + ".pluggable-class";
{code}
Maybe leaving the pluggable-class is sufficient?

3) Set getAndWatch(), 
 I'm not sure what does the "Watch" mean? Should it be just getDevices?

4) It looks like you try to make DevicePlugin agnostic to Container itself, 
maybe we should change the name:
 preLaunchContainer => allocateDevices 
 postCompleteContainer => releaseDevices?

5) DeviceRuntimeSpec is empty, what you plan to add?

6) The purpose of {{DevicePluginAdapter}} is to handle all resource plugins, 
however, given DevicePlugin interface and DevicePluginAdapter are not quite 
matching. It is very likely that we need customized logic for 
DevicePluginAdapter. Such as how to manipulate Docker command could be quite 
different for GPU and FPGA. So instead of only make pluggable interface for 
DevicePlugin itself, should we use Factory pattern to make all required 
interfaces pluggable?

What I meant is,
 Change:
{code:java}
.pluggable-class
{code}
To {{.pluggable-factory-class}}. And device provider should provide a factory 
method which can returns {{DevicePluginAdapter}} and {{DevicePlugin}} instances.

I also felt it will be better if we can make the scheduler to be part of the 
factory given how to allocate resources for different devices could be 
different.

So the Factory interface could have following method.
{code:java}
 
DevicePluginFactory {
 DevicePlugin getDevicePlugin();
 DevicePluginAdapter getDevicePluginAdapter();
 DevicePluginScheduler getDevicePluginScheduler();

}
{code}
Or, if you think DevicePlugin/DevicePluginScheduler should be internal 
implementation details of getDevicePluginAdapter, we can only leave 
getDevicePluginAdapter, and maybe rename it to getDevicePlugin().

And I think it gonna be fine to leave a common implementation for PluginAdapter 
which exists inside NM, but the DevicePlugin interface should be at least close 
to the PluginAdapter interface, otherwise it is very hard to bridge the two 
interfaces.

> [Umbrella] A new pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652717#comment-16652717
 ] 

Wangda Tan commented on YARN-8489:
--

[~eyang], 
{quote} A safer approach to enable this logic is to have a boolean flag in 
component level to indicate "report_as_service_state":true. This requires no 
alteration to state transition logic, but add a check in the end.
{quote}
Yeah, I think you got the right point, we don't need to change state transition 
logic, we just need an additional check there. I'm fine with adding the 
addition "report_as_service_state" flag instead of dominant. 

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652645#comment-16652645
 ] 

Wangda Tan commented on YARN-8513:
--

Sounds like a plan, default value set to 100 may make more sense. thanks 
[~cheersyang]

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 3.1.0, 2.9.1
> Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652606#comment-16652606
 ] 

Wangda Tan commented on YARN-8489:
--

[~eyang],

let me try to answer your questions: 
{quote}Data scientist specify the cluster spec in notebook, parameter server 
partitions the models and tasks to increase workers effectiveness.
{quote}
Actually people want to avoid using PS as much as possible in TF given the poor 
performance of grpc and overhead of network communication. However because it 
is the only solution for Distributed TF now, people will use it when needed. 
Comparing to standalone TF, distributed TF has much fewer user bases. 

What I'm thinking now is, only final status of dominant component (not 
component instance) will impact service's state. Regarding to your questions. 

bq. For example, what happen if during upgrade the dominant component is 
offline. Should the service terminate and clean up?
No if dominant component is not in final state yet. Upgrading is not considered 
as final state.

bq. How about flex dominant component to lesser nodes?
Flexing is not final state, so will not be impacted by the patch. 

bq. What is the order to evaluate dominant component and component dependencies?
No addition evaluations needed, once dominant component succeeded / failed, 
service master will finalize service.  

bq.  How to handle restart policy in place of dominant component?
If it is never, dominant field will be ignored. Otherwise dominant field is 
allowed. 

Hope this explanations makes you clear about the scope. Heres logics for 
dominant component affect state of service: 

{code} 
Component.state: 
- Transition to SUCCEEDED && component.dominant == true: Set service state to 
SUCCEEDED. 
- Transition to FAILED && component.dominant == true. Set service state to 
FAILED. 
{code}

 

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8892:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.2.0)

> YARN UI2 doc changes to update security status (verified under security 
> environment)
> 
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652453#comment-16652453
 ] 

Wangda Tan commented on YARN-8489:
--

[~eyang], 

This is bit different from Spark executors. 

For Spark, from external view, it is a fully managed service, which can run 
tasks inside the Spark executors. Livy is just responsible to send code to 
Spark service and wait the result.

For TF, PS can be deployed outside of workers like what you shown, but 
computation is still executed inside worker. In your example, it is inside the 
notebook. 

The separated PS deployment is not a widely used feature, AFAIK, only Google 
internally deploys in that way, part of the reason is they have super large 
models require distributed PS.

The separate PS deployment approach is not easy to manage, need user to modify 
their source code, etc. And for most of the use cases, people avoid the 
distributed model given it is very hard to manage, serving, etc.

 

After talked to many companies, for Submarine, in a short to mid term, I prefer 
to only support PS within each job.

To your concern :

 
{quote}Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration? The only thing that user needs to write is 
worker.py which is use case driven. Am I missing something?
{quote}
The easiest way is not to handle PS at all from the notebook, user can choose 
Keras, etc. to build their model inside notebook. Handling separate logics 
inside notebook for PS is just an overhead to users.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8892:
-
Summary: YARN UI2 doc changes to update security status (verified under 
security environment)  (was: YARN UI2 doc improvement to update security status)

> YARN UI2 doc changes to update security status (verified under security 
> environment)
> 
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652405#comment-16652405
 ] 

Wangda Tan commented on YARN-8892:
--

+1, committing, thanks [~sunilg].

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >