[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692696#comment-16692696
 ] 

Zhankun Tang commented on YARN-8714:


[~liuxun323], [~leftnoteasy], [~yuan_zac]. Updated a new version. More tests 
are needed. But please help to review incase the changes are too big.
For all ps and worker component, it provides --localization to support both 
HDFS and local directory (may download, then zip it in temp dir and then upload 
to HDFS) as the first part of the parameter. And it also supports mount into 
the container with permission(default is rw)

Goals are:

{code:java}
--localization hdfs:///user/yarn/mydir:.:ro # download, zip and upload the zip 
file to HDFS. then mount into container word dir's "mydir" folder. {PWD}/mydir
--localization /user/yarn/mydir2:/opt/dir2 # zip local dir and upload the zip 
file to HDFS. set mount into container's /opt/dir2
--localization /user/yarn/script1.py:.:ro # upload a script to HDFS and set 
mount into container {PWD}/script1.py
--localization /user/yarn/script1.py:/opt/script2.py:rw # upload and mount into 
container /opt/script2.py
{code}


> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692696#comment-16692696
 ] 

Zhankun Tang edited comment on YARN-8714 at 11/20/18 5:58 AM:
--

[~sunilg], [~liuxun323], [~leftnoteasy], [~yuan_zac]. Updated a new version. 
More tests are in progress. But please help to review in case our direction is 
not right.
For all ps and worker component, it provides --localization to support both 
HDFS and local directory (may download, then zip it in temp dir and then upload 
to HDFS) as the first part of the parameter. And it also supports mount into 
the container with permission(default is rw)

Goals are:
{code:java}
--localization hdfs:///user/yarn/mydir:.:ro # download, zip and upload the zip 
file to HDFS. then mount into container word dir's "mydir" folder. {PWD}/mydir
--localization /user/yarn/mydir2:/opt/dir2 # zip local dir and upload the zip 
file to HDFS. set mount into container's /opt/dir2
--localization /user/yarn/script1.py:.:ro # upload a script to HDFS and set 
mount into container {PWD}/script1.py
--localization /user/yarn/script1.py:/opt/script2.py:rw # upload and mount into 
container /opt/script2.py
{code}


was (Author: tangzhankun):
[~liuxun323], [~leftnoteasy], [~yuan_zac]. Updated a new version. More tests 
are in progress. But please help to review in case our direction is not right.
For all ps and worker component, it provides --localization to support both 
HDFS and local directory (may download, then zip it in temp dir and then upload 
to HDFS) as the first part of the parameter. And it also supports mount into 
the container with permission(default is rw)

Goals are:
{code:java}
--localization hdfs:///user/yarn/mydir:.:ro # download, zip and upload the zip 
file to HDFS. then mount into container word dir's "mydir" folder. {PWD}/mydir
--localization /user/yarn/mydir2:/opt/dir2 # zip local dir and upload the zip 
file to HDFS. set mount into container's /opt/dir2
--localization /user/yarn/script1.py:.:ro # upload a script to HDFS and set 
mount into container {PWD}/script1.py
--localization /user/yarn/script1.py:/opt/script2.py:rw # upload and mount into 
container /opt/script2.py
{code}

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692696#comment-16692696
 ] 

Zhankun Tang edited comment on YARN-8714 at 11/20/18 5:55 AM:
--

[~liuxun323], [~leftnoteasy], [~yuan_zac]. Updated a new version. More tests 
are in progress. But please help to review in case our direction is not right.
For all ps and worker component, it provides --localization to support both 
HDFS and local directory (may download, then zip it in temp dir and then upload 
to HDFS) as the first part of the parameter. And it also supports mount into 
the container with permission(default is rw)

Goals are:
{code:java}
--localization hdfs:///user/yarn/mydir:.:ro # download, zip and upload the zip 
file to HDFS. then mount into container word dir's "mydir" folder. {PWD}/mydir
--localization /user/yarn/mydir2:/opt/dir2 # zip local dir and upload the zip 
file to HDFS. set mount into container's /opt/dir2
--localization /user/yarn/script1.py:.:ro # upload a script to HDFS and set 
mount into container {PWD}/script1.py
--localization /user/yarn/script1.py:/opt/script2.py:rw # upload and mount into 
container /opt/script2.py
{code}


was (Author: tangzhankun):
[~liuxun323], [~leftnoteasy], [~yuan_zac]. Updated a new version. More tests 
are needed. But please help to review incase the changes are too big.
For all ps and worker component, it provides --localization to support both 
HDFS and local directory (may download, then zip it in temp dir and then upload 
to HDFS) as the first part of the parameter. And it also supports mount into 
the container with permission(default is rw)

Goals are:

{code:java}
--localization hdfs:///user/yarn/mydir:.:ro # download, zip and upload the zip 
file to HDFS. then mount into container word dir's "mydir" folder. {PWD}/mydir
--localization /user/yarn/mydir2:/opt/dir2 # zip local dir and upload the zip 
file to HDFS. set mount into container's /opt/dir2
--localization /user/yarn/script1.py:.:ro # upload a script to HDFS and set 
mount into container {PWD}/script1.py
--localization /user/yarn/script1.py:/opt/script2.py:rw # upload and mount into 
container /opt/script2.py
{code}


> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use

2018-11-20 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8882:
---
Attachment: YARN-8882-trunk.004.patch

> Phase 1 - Add a shared device mapping manager for device plugin to use
> --
>
> Key: YARN-8882
> URL: https://issues.apache.org/jira/browse/YARN-8882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, 
> YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch
>
>
> Since quite a few devices uses FIFO policy to assign devices to the 
> container, we use a shared device manager to handle all types of devices.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use

2018-11-20 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8882:
---
Attachment: YARN-8882-trunk.007.patch

> Phase 1 - Add a shared device mapping manager for device plugin to use
> --
>
> Key: YARN-8882
> URL: https://issues.apache.org/jira/browse/YARN-8882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, 
> YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, 
> YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, 
> YARN-8882-trunk.007.patch
>
>
> Since a few devices uses FIFO policy to assign devices to the container, we 
> use a shared device manager to handle all types of devices.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use

2018-11-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694457#comment-16694457
 ] 

Zhankun Tang commented on YARN-8882:


[~goyal.sunil] , [~leftnoteasy], Could you please help to review?

> Phase 1 - Add a shared device mapping manager for device plugin to use
> --
>
> Key: YARN-8882
> URL: https://issues.apache.org/jira/browse/YARN-8882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, 
> YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, 
> YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, 
> YARN-8882-trunk.007.patch
>
>
> Since a few devices uses FIFO policy to assign devices to the container, we 
> use a shared device manager to handle all types of devices.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use

2018-11-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694457#comment-16694457
 ] 

Zhankun Tang edited comment on YARN-8882 at 11/21/18 9:28 AM:
--

[~sunilg] , [~leftnoteasy], Could you please help to review?


was (Author: tangzhankun):
[~goyal.sunil] , [~leftnoteasy], Could you please help to review?

> Phase 1 - Add a shared device mapping manager for device plugin to use
> --
>
> Key: YARN-8882
> URL: https://issues.apache.org/jira/browse/YARN-8882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, 
> YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, 
> YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, 
> YARN-8882-trunk.007.patch
>
>
> Since a few devices uses FIFO policy to assign devices to the container, we 
> use a shared device manager to handle all types of devices.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694779#comment-16694779
 ] 

Zhankun Tang commented on YARN-8714:


[~liuxun323] , One question from me. Do you need to support bind-mount 
localized file/dir into a container with RO permission?

Because YARN mounts the application/container dir with hard-coded RW permission 
so that all locolized file/dir is writable to Docker container process. In this 
case, RO permission for localized file/dir seems impossible until we improve 
YARN's current implementation on this.

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-22 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8714:
---
Attachment: YARN-8714-trunk.002.patch

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch, YARN-8714-trunk.001.patch, 
> YARN-8714-trunk.002.patch
>
>
> See 
> [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7],
>  {{job run --localization ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8887) Support isolation in pluggable device framework

2018-11-22 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8887:
---
Description: 
Devices isolation needs a complete description in API specs(DeviceRuntimeSpec) 
and a translator in the adapter to convert the requirements into uniform 
parameters passed to native container-executor. It should support both default 
and Docker container.

For default container, we use a new device module in container-executor to 
isolate device. For docker container, we depend on current 
DockerLinuxContainerRuntime.

  was:Devices isolation needs a complete description in API specs and a 
translator in the adapter to convert the requirements into uniform parameters 
passed to native container-executor. It should support both cgroups and Docker 
runtime.


> Support isolation in pluggable device framework
> ---
>
> Key: YARN-8887
> URL: https://issues.apache.org/jira/browse/YARN-8887
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Devices isolation needs a complete description in API 
> specs(DeviceRuntimeSpec) and a translator in the adapter to convert the 
> requirements into uniform parameters passed to native container-executor. It 
> should support both default and Docker container.
> For default container, we use a new device module in container-executor to 
> isolate device. For docker container, we depend on current 
> DockerLinuxContainerRuntime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9042) Javadoc error in deviceplugin package

2018-11-22 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695693#comment-16695693
 ] 

Zhankun Tang commented on YARN-9042:


[~rohithsharma] , the patch is uploaded. Please help to review. Thanks!

> Javadoc error in deviceplugin package
> -
>
> Key: YARN-9042
> URL: https://issues.apache.org/jira/browse/YARN-9042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9042-trunk.001.patch
>
>
> Many java doc errors are in deviceplugin
> {noformat}
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:29:
>  error: bad HTML entity
> [ERROR]  * This is a spec used to prepare & run container.
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:35:
>  error: bad HTML entity
> [ERROR]  * The volume & device mounts describes key isolation requirements
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: domain
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: bus
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]  ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: slot
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: func
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use

2018-11-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695504#comment-16695504
 ] 

Zhankun Tang commented on YARN-8882:


[~leftnoteasy] , Yeah. And it will include a list of device plugin scheduler if 
we merge the customized device scheduler interface. Do we have a better name 
than "DeviceScheduler"? the original "DeviceSchedulerManager"?

> Phase 1 - Add a shared device mapping manager for device plugin to use
> --
>
> Key: YARN-8882
> URL: https://issues.apache.org/jira/browse/YARN-8882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, 
> YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch, 
> YARN-8882-trunk.005.patch, YARN-8882-trunk.006.patch, 
> YARN-8882-trunk.007.patch
>
>
> Since a few devices uses FIFO policy to assign devices to the container, we 
> use a shared device manager to handle all types of devices.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9042) Javadoc error in deviceplugin package

2018-11-22 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9042:
---
Attachment: YARN-9042-trunk.001.patch

> Javadoc error in deviceplugin package
> -
>
> Key: YARN-9042
> URL: https://issues.apache.org/jira/browse/YARN-9042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9042-trunk.001.patch
>
>
> Many java doc errors are in deviceplugin
> {noformat}
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:29:
>  error: bad HTML entity
> [ERROR]  * This is a spec used to prepare & run container.
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:35:
>  error: bad HTML entity
> [ERROR]  * The volume & device mounts describes key isolation requirements
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: domain
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: bus
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]  ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: slot
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: func
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-21 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8714:
---
Description: See 
[https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7],
 {{job run --localization ...}}  (was: See 
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
 {{job run --localizations ...}})

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch
>
>
> See 
> [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7],
>  {{job run --localization ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9042) Javadoc error in deviceplugin package

2018-11-21 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang reassigned YARN-9042:
--

Assignee: Zhankun Tang

> Javadoc error in deviceplugin package
> -
>
> Key: YARN-9042
> URL: https://issues.apache.org/jira/browse/YARN-9042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Assignee: Zhankun Tang
>Priority: Major
>
> Many java doc errors are in deviceplugin
> {noformat}
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:29:
>  error: bad HTML entity
> [ERROR]  * This is a spec used to prepare & run container.
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:35:
>  error: bad HTML entity
> [ERROR]  * The volume & device mounts describes key isolation requirements
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: domain
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: bus
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]  ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: slot
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: func
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9042) Javadoc error in deviceplugin package

2018-11-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695507#comment-16695507
 ] 

Zhankun Tang commented on YARN-9042:


Thanks [~rohithsharma] for pointing this. I'll fix it. :)

> Javadoc error in deviceplugin package
> -
>
> Key: YARN-9042
> URL: https://issues.apache.org/jira/browse/YARN-9042
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Major
>
> Many java doc errors are in deviceplugin
> {noformat}
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:29:
>  error: bad HTML entity
> [ERROR]  * This is a spec used to prepare & run container.
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/DeviceRuntimeSpec.java:35:
>  error: bad HTML entity
> [ERROR]  * The volume & device mounts describes key isolation requirements
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: domain
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: bus
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]  ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: slot
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> [ERROR]   ^
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/api/deviceplugin/Device.java:56:
>  error: unknown tag: func
> [ERROR]* PCI Bus ID in format ]:]]:][][.[]].
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-21 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8714:
---
Attachment: YARN-8714-trunk.001.patch

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch, 
> YARN-8714-WIP1-trunk-002.patch, YARN-8714-trunk.001.patch
>
>
> See 
> [https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7],
>  {{job run --localization ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests

2018-11-20 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692955#comment-16692955
 ] 

Zhankun Tang commented on YARN-5106:


Thanks, [~snemeth] . +1, This LGTM.

> Provide a builder interface for FairScheduler allocations for use in tests
> --
>
> Key: YARN-5106
> URL: https://issues.apache.org/jira/browse/YARN-5106
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie++
> Attachments: YARN-5106.001.patch, YARN-5106.002.patch, 
> YARN-5106.003.patch, YARN-5106.004.patch
>
>
> Most, if not all, fair scheduler tests create an allocations XML file. Having 
> a helper class that potentially uses a builder would make the tests cleaner. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8882) Phase 1 - Add a shared device mapping manager for device plugin to use

2018-11-20 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8882:
---
Description: Since a few devices uses FIFO policy to assign devices to the 
container, we use a shared device manager to handle all types of devices.  
(was: Since quite a few devices uses FIFO policy to assign devices to the 
container, we use a shared device manager to handle all types of devices.)

> Phase 1 - Add a shared device mapping manager for device plugin to use
> --
>
> Key: YARN-8882
> URL: https://issues.apache.org/jira/browse/YARN-8882
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8882-trunk.001.patch, YARN-8882-trunk.002.patch, 
> YARN-8882-trunk.003.patch, YARN-8882-trunk.004.patch
>
>
> Since a few devices uses FIFO policy to assign devices to the container, we 
> use a shared device manager to handle all types of devices.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9059) Support RESTful API in NM for query FPGA allocation

2018-11-26 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9059:
--

 Summary: Support RESTful API in NM for query FPGA allocation
 Key: YARN-9059
 URL: https://issues.apache.org/jira/browse/YARN-9059
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Support it for the user to be able to:

curl :8042/ws/v1/node/resources/yarn.io%2Ffpga



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2018-11-26 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9060:
--

 Summary: [YARN-8851] Phase 1 - Support device isolation in native 
container-executor
 Key: YARN-9060
 URL: https://issues.apache.org/jira/browse/YARN-9060
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
the value of the device cgroups controller unless we have the root permission 
([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
 So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8851) [Umbrella] A pluggable device plugin framework to ease vendor plugin development

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8851:
---
Summary: [Umbrella] A pluggable device plugin framework to ease vendor 
plugin development  (was: [Umbrella] A new pluggable device plugin framework to 
ease vendor plugin development)

> [Umbrella] A pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, 
> YARN-8851-WIP5-trunk.001.patch, YARN-8851-WIP6-trunk.001.patch, 
> YARN-8851-WIP7-trunk.001.patch, YARN-8851-WIP8-trunk.001.patch, 
> YARN-8851-WIP9-trunk.001.patch, YARN-8851-trunk.001.patch, 
> YARN-8851-trunk.002.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-4.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Attachment: YARN-9060-trunk.001.patch

> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8822) Nvidia-docker v2 support

2018-11-27 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700056#comment-16700056
 ] 

Zhankun Tang commented on YARN-8822:


[~Charo Zhang], Thanks for the patch! It looks good to me.
[~leftnoteasy], could you please review the patch?

> Nvidia-docker v2 support
> 
>
> Key: YARN-8822
> URL: https://issues.apache.org/jira/browse/YARN-8822
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.1
>Reporter: Zhankun Tang
>Assignee: Charo Zhang
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.2
>
> Attachments: YARN-8822-branch-3.1.1.001.patch, YARN-8822.001.patch, 
> YARN-8822.002.patch
>
>
> To run a GPU container with Docker, we have nvdia-docker v1 support already 
> but is deprecated per 
> [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We 
> should support nvdia-docker v2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9061) Improve the GPU/FPGA module log message of container-executor

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9061:
---
Attachment: YARN-9061-trunk.001.patch

> Improve the GPU/FPGA module log message of container-executor
> -
>
> Key: YARN-9061
> URL: https://issues.apache.org/jira/browse/YARN-9061
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-9061-trunk.001.patch
>
>
> The log message is not clear when options value is missing.
> {code:java}
> fprintf(ERRORFILE, "is not specified, skip cgroups call.\n");{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9061) Improve the GPU/FPGA module log message of container-executor

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9061:
---
Attachment: YARN-9061-trunk.002.patch

> Improve the GPU/FPGA module log message of container-executor
> -
>
> Key: YARN-9061
> URL: https://issues.apache.org/jira/browse/YARN-9061
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-9061-trunk.001.patch, YARN-9061-trunk.002.patch
>
>
> The log message is not clear when options value is missing.
> {code:java}
> fprintf(ERRORFILE, "is not specified, skip cgroups call.\n");{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9061) Improve the GPU/FPGA module log message of container-executor

2018-11-26 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9061:
--

 Summary: Improve the GPU/FPGA module log message of 
container-executor
 Key: YARN-9061
 URL: https://issues.apache.org/jira/browse/YARN-9061
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhankun Tang
Assignee: Zhankun Tang


The log message is not clear when options value is missing.
{code:java}
fprintf(ERRORFILE, "is not specified, skip cgroups call.\n");{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8822) Nvidia-docker v2 support

2018-11-26 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698902#comment-16698902
 ] 

Zhankun Tang commented on YARN-8822:


[~Charo Zhang] , Thanks for the patch. Since a new pluggable device framework 
(YARN-8851) is in progress, we should not prefer to merge vendor specific code 
into the YARN code base in the future.

I guess it would be better for us to implement this based on that framework.

> Nvidia-docker v2 support
> 
>
> Key: YARN-8822
> URL: https://issues.apache.org/jira/browse/YARN-8822
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.1
>Reporter: Zhankun Tang
>Assignee: Charo Zhang
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.2
>
> Attachments: YARN-8822.001.patch
>
>
> To run a GPU container with Docker, we have nvdia-docker v1 support already 
> but is deprecated per 
> [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We 
> should support nvdia-docker v2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8822) Nvidia-docker v2 support

2018-11-26 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699114#comment-16699114
 ] 

Zhankun Tang commented on YARN-8822:


[~Charo Zhang] , the patch name should be like YARN-8822-branch-3.1.1.001.path 
to trigger the Yetus.

And take a look at the code, it generally looks good to me beside some check 
style issues.

Have you tested in a real GPU environment?

> Nvidia-docker v2 support
> 
>
> Key: YARN-8822
> URL: https://issues.apache.org/jira/browse/YARN-8822
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.1
>Reporter: Zhankun Tang
>Assignee: Charo Zhang
>Priority: Major
>  Labels: Docker
> Fix For: 3.1.2
>
> Attachments: YARN-8822.001.patch, YARN-8822.002.patch
>
>
> To run a GPU container with Docker, we have nvdia-docker v1 support already 
> but is deprecated per 
> [here|https://github.com/NVIDIA/nvidia-docker/wiki/About-version-2.0]. We 
> should support nvdia-docker v2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8820) [Umbrella] GPU support on YARN - Phase 2

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8820:
---
Description: 
In YARN-6223, we've done a basic support for Nvidia GPU on YARN including 
resource discovery, allocation, cgroups isolation as well as docker support 
(Nvidia-docker v1). But there's still room for us to improve.

For instance, multiple GPU cards in one host bring the requirements of GPU 
hierarchy scheduling. The Nvidia-docker v2 emerged and v1 has been deprecated. 
And we're planning a new device plugin framework in YARN which has relation to 
GPU support too. (maybe in the long term)

So here we converge threads related to the above and open an umbrella here to 
track the next stage tasks for convenience.

One thing to note is that a pluggable device framework is in progress 
(YARN-8851), once that framework is mature, we should prefer to utilize the 
ability of the framework to achieve these phase 2 support.

  was:
In YARN-6223, we've done a basic support for Nvidia GPU on YARN including 
resource discovery, allocation, cgroups isolation as well as docker support 
(Nvidia-docker v1). But there's still room for us to improve.

For instance, multiple GPU cards in one host bring the requirements of GPU 
hierarchy scheduling. The Nvidia-docker v2 emerged and v1 has been deprecated. 
And we're planning a new device plugin framework in YARN which has relation to 
GPU support too. (maybe in the long term)

So here we converge threads related to the above and open an umbrella here to 
track the next stage tasks for convenience.


> [Umbrella] GPU support on YARN - Phase 2
> 
>
> Key: YARN-8820
> URL: https://issues.apache.org/jira/browse/YARN-8820
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Priority: Major
>
> In YARN-6223, we've done a basic support for Nvidia GPU on YARN including 
> resource discovery, allocation, cgroups isolation as well as docker support 
> (Nvidia-docker v1). But there's still room for us to improve.
> For instance, multiple GPU cards in one host bring the requirements of GPU 
> hierarchy scheduling. The Nvidia-docker v2 emerged and v1 has been 
> deprecated. And we're planning a new device plugin framework in YARN which 
> has relation to GPU support too. (maybe in the long term)
> So here we converge threads related to the above and open an umbrella here to 
> track the next stage tasks for convenience.
> One thing to note is that a pluggable device framework is in progress 
> (YARN-8851), once that framework is mature, we should prefer to utilize the 
> ability of the framework to achieve these phase 2 support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8823) Monitor the healthy state of GPU

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang reassigned YARN-8823:
--

Assignee: Zhankun Tang

> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8821) GPU hierarchy scheduling support

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
GPU topology affects performance dramatically. There's been a discussion in 
YARN-7481. But we'd like to move related discussions here.

Please note that YARN-8851 will provide a pluggable device framework which has 
a shared scheduler which could support default topology scheduling. And Based 
on the framework, GPU plugin could have custom scheduler too.

  was:GPU topology affects performance dramatically. There's been a discussion 
in YARN-7481. But we'd like to move related discussions here.


> GPU hierarchy scheduling support
> 
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Priority: Major
>
> GPU topology affects performance dramatically. There's been a discussion in 
> YARN-7481. But we'd like to move related discussions here.
> Please note that YARN-8851 will provide a pluggable device framework which 
> has a shared scheduler which could support default topology scheduling. And 
> Based on the framework, GPU plugin could have custom scheduler too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9033) ResourceHandlerChain#bootstrap is invoked twice during NM start if LinuxContainerExecutor enabled

2018-11-17 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9033:
--

 Summary: ResourceHandlerChain#bootstrap is invoked twice during NM 
start if LinuxContainerExecutor enabled
 Key: YARN-9033
 URL: https://issues.apache.org/jira/browse/YARN-9033
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Reporter: Zhankun Tang
Assignee: Zhankun Tang


The ResourceHandlerChain#bootstrap will always be invoked in NM's 
ContainerScheduler#serviceInit.

So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked first 
and then invoked again in ContainerScheduler#serviceInit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9033) ResourceHandlerChain#bootstrap is invoked twice during NM start if LinuxContainerExecutor enabled

2018-11-17 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9033:
---
Description: 
The ResourceHandlerChain#bootstrap will always be invoked in NM's 
ContainerScheduler#serviceInit (Involved by YARN-7715)

So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked first 
and then invoked again in ContainerScheduler#serviceInit.

  was:
The ResourceHandlerChain#bootstrap will always be invoked in NM's 
ContainerScheduler#serviceInit.

So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked first 
and then invoked again in ContainerScheduler#serviceInit.


> ResourceHandlerChain#bootstrap is invoked twice during NM start if 
> LinuxContainerExecutor enabled
> -
>
> Key: YARN-9033
> URL: https://issues.apache.org/jira/browse/YARN-9033
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> The ResourceHandlerChain#bootstrap will always be invoked in NM's 
> ContainerScheduler#serviceInit (Involved by YARN-7715)
> So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked 
> first and then invoked again in ContainerScheduler#serviceInit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8823) Monitor the healthy state of GPU

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8823:
---
Description: 
We have GPU resource discovered when the NM bootstrap but not updated through 
later heatbeat with RM. There should be a monitoring mechanism to check GPU 
healthy status from time to time and also the corresponding handling.

And YARN-8851 will also handle device's monitoring. There could be some common 
part between the two.

  was:We have GPU resource discovered when the NM bootstrap but not updated 
through later heatbeat with RM. There should be a monitoring mechanism to check 
GPU healthy status from time to time and also the corresponding handling.


> Monitor the healthy state of GPU
> 
>
> Key: YARN-8823
> URL: https://issues.apache.org/jira/browse/YARN-8823
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Priority: Major
>
> We have GPU resource discovered when the NM bootstrap but not updated through 
> later heatbeat with RM. There should be a monitoring mechanism to check GPU 
> healthy status from time to time and also the corresponding handling.
> And YARN-8851 will also handle device's monitoring. There could be some 
> common part between the two.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8821) GPU hierarchy scheduling support

2018-11-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang reassigned YARN-8821:
--

Assignee: Zhankun Tang

> GPU hierarchy scheduling support
> 
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> GPU topology affects performance dramatically. There's been a discussion in 
> YARN-7481. But we'd like to move related discussions here.
> Please note that YARN-8851 will provide a pluggable device framework which 
> has a shared scheduler which could support default topology scheduling. And 
> Based on the framework, GPU plugin could have custom scheduler too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9190) [Submarine] Submarine job will fail to run as a first job on a new created Hadoop 3.2.0 RC1

2019-01-10 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9190:
--

 Summary: [Submarine] Submarine job will fail to run as a first job 
on a new created Hadoop 3.2.0 RC1
 Key: YARN-9190
 URL: https://issues.apache.org/jira/browse/YARN-9190
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhankun Tang
Assignee: Sunil Govindan


This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. The 
reproduce steps are:
 # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
 # Before run any other yarn service job, use yarn user to submit a submarine 
job

The job will fail with below error:

 
{code:java}
LogType:serviceam-err.txt
LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
LogLength:86
LogContents:
Error: Could not find or load main class 
org.apache.hadoop.yarn.service.ServiceMaster
End of LogType:serviceam-err.txt
{code}
This seems because the dependencies are not ready as the service client 
reported:
{code:java}
2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
yarn.service.framework.path has a value 
/yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
jars to HDFS. For faster submission of apps, set config property 
yarn.service.framework.path to the dependency tarball location. Dependency 
tarball can be uploaded to any HDFS path directly or by using command: yarn app 
-enableFastLaunch []{code}
 

When this error happens, I found that there is no “/yarn-services” directory 
created in HDFS.

But after I run “yarn app -launch my-sleeper sleeper”, the “/yarn-services” 
created in HDFS and then the submarine job can run successfully.

It seems an issue of yarn service in 3.2.0 RC1 and I files this Jira to track 
it.

 

And verified that trunk branch doesn't have this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9190) [Submarine] Submarine job will fail to run as a first job on a new created Hadoop 3.2.0 RC1 cluster

2019-01-10 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9190:
---
Summary: [Submarine] Submarine job will fail to run as a first job on a new 
created Hadoop 3.2.0 RC1 cluster  (was: [Submarine] Submarine job will fail to 
run as a first job on a new created Hadoop 3.2.0 RC1)

> [Submarine] Submarine job will fail to run as a first job on a new created 
> Hadoop 3.2.0 RC1 cluster
> ---
>
> Key: YARN-9190
> URL: https://issues.apache.org/jira/browse/YARN-9190
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Sunil Govindan
>Priority: Minor
>
> This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. 
> The reproduce steps are:
>  # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
>  # Before run any other yarn service job, use yarn user to submit a submarine 
> job
> The job will fail with below error:
>  
> {code:java}
> LogType:serviceam-err.txt
> LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
> LogLength:86
> LogContents:
> Error: Could not find or load main class 
> org.apache.hadoop.yarn.service.ServiceMaster
> End of LogType:serviceam-err.txt
> {code}
> This seems because the dependencies are not ready as the service client 
> reported:
> {code:java}
> 2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
> yarn.service.framework.path has a value 
> /yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
> 2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
> jars to HDFS. For faster submission of apps, set config property 
> yarn.service.framework.path to the dependency tarball location. Dependency 
> tarball can be uploaded to any HDFS path directly or by using command: yarn 
> app -enableFastLaunch []{code}
>  
> When this error happens, I found that there is no “/yarn-services” directory 
> created in HDFS.
> But after I run “yarn app -launch my-sleeper sleeper”, the “/yarn-services” 
> created in HDFS and then the submarine job can run successfully.
> It seems an issue of yarn service in 3.2.0 RC1 and I files this Jira to track 
> it.
>  
> And verified that trunk branch doesn't have this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"

2019-01-07 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735941#comment-16735941
 ] 

Zhankun Tang commented on YARN-8927:


A draft patch WIP. Please comment in case the wrong direction. [~eyang] , 
[~ebadger]

> Support trust top-level image like "centos" when "library" is configured in 
> "docker.trusted.registries"
> ---
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8927-trunk.001.patch
>
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9168) DistributedShell client timeout should be -1 by default

2019-01-07 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9168:
---
Attachment: YARN-9168-trunk.001.patch

> DistributedShell client timeout should be -1 by default
> ---
>
> Key: YARN-9168
> URL: https://issues.apache.org/jira/browse/YARN-9168
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-9168-trunk.001.patch
>
>
> The DS client will force kill an application due to default a 600s timeout. 
> This is confusing. It should be -1 to indicate an infinite time. If a user 
> wants a timeout, he/she should set it explicitly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9168) DistributedShell client timeout should be -1 by default

2019-01-07 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735902#comment-16735902
 ] 

Zhankun Tang commented on YARN-9168:


[~cheersyang] , Yeah. Agree. Please take a look at the patch. I make timeout>0 
as a condition to simplify the code.

> DistributedShell client timeout should be -1 by default
> ---
>
> Key: YARN-9168
> URL: https://issues.apache.org/jira/browse/YARN-9168
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Minor
> Attachments: YARN-9168-trunk.001.patch
>
>
> The DS client will force kill an application due to default a 600s timeout. 
> This is confusing. It should be -1 to indicate an infinite time. If a user 
> wants a timeout, he/she should set it explicitly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"

2019-01-07 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8927:
---
Summary: Support trust top-level image like "centos" when "library" is 
configured in "docker.trusted.registries"  (was: Better handling of 
"docker.trusted.registries" in container-executor's "trusted_image_check" 
function)

> Support trust top-level image like "centos" when "library" is configured in 
> "docker.trusted.registries"
> ---
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8927-trunk.001.patch
>
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"

2019-01-07 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8927:
---
Attachment: YARN-8927-trunk.001.patch

> Support trust top-level image like "centos" when "library" is configured in 
> "docker.trusted.registries"
> ---
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8927-trunk.001.patch
>
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"

2019-01-07 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8927:
---
Attachment: YARN-8927-trunk.002.patch

> Support trust top-level image like "centos" when "library" is configured in 
> "docker.trusted.registries"
> ---
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch
>
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"

2019-01-07 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736650#comment-16736650
 ] 

Zhankun Tang commented on YARN-8927:


[~eyang] , Thanks for the review! Yeah, it doesn't consider the local image 
list in this JIRA.  If I remember correctly, it will be handled in YARN-8955?

> Support trust top-level image like "centos" when "library" is configured in 
> "docker.trusted.registries"
> ---
>
> Key: YARN-8927
> URL: https://issues.apache.org/jira/browse/YARN-8927
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch
>
>
> There are some missing cases that we need to catch when handling 
> "docker.trusted.registries".
> The container-executor.cfg configuration is as follows:
> {code:java}
> docker.trusted.registries=tangzhankun,ubuntu,centos{code}
> It works if run DistrubutedShell with "tangzhankun/tensorflow"
> {code:java}
> "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow
> {code}
> But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" 
> and "ubuntu[:tagName]" fails:
> The error message is like:
> {code:java}
> "image: centos is not trusted"
> {code}
> We need better handling the above cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9176) [Submarine] Repair 404 error of links in documentation

2019-01-04 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733924#comment-16733924
 ] 

Zhankun Tang commented on YARN-9176:


[~hongdd] , Thanks for raising this. Could you post a screenshot of the correct 
link after your patch?

> [Submarine] Repair  404 error of  links in documentation 
> -
>
> Key: YARN-9176
> URL: https://issues.apache.org/jira/browse/YARN-9176
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.2.0
>Reporter: Dongdong Hong
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: 404 error.jpg, 404.jpg, 
> YARN-9176.-Submarine-Repair-404-error-of-links-in-do.patch
>
>
> links in src/site/markdown/Examples.md and src/site/markdown/QuickStart.md 
> will get 404, repair this links. !404.jpg!!404 error.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9168) DistributedShell client timeout should be -1 by default

2019-01-02 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9168:
--

 Summary: DistributedShell client timeout should be -1 by default
 Key: YARN-9168
 URL: https://issues.apache.org/jira/browse/YARN-9168
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhankun Tang
Assignee: Zhankun Tang


The DS client will force kill an application due to default a 600s timeout. 
This is confusing. It should be -1 to indicate an infinite time. If a user 
wants a timeout, he/she should set it explicitly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9190) [Submarine] Submarine job will fail to run as a first job on a new created Hadoop 3.2.0 RC1 cluster

2019-01-10 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739899#comment-16739899
 ] 

Zhankun Tang commented on YARN-9190:


[~billie.rinaldi] , [~eyang] , [~csingh] . Do you know which patch is needed to 
backport to solve this issue? Thanks.

> [Submarine] Submarine job will fail to run as a first job on a new created 
> Hadoop 3.2.0 RC1 cluster
> ---
>
> Key: YARN-9190
> URL: https://issues.apache.org/jira/browse/YARN-9190
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Sunil Govindan
>Priority: Minor
>
> This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. 
> The reproduce steps are:
>  # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
>  # Before run any other yarn service job, use yarn user to submit a submarine 
> job
> The job will fail with below error:
>  
> {code:java}
> LogType:serviceam-err.txt
> LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
> LogLength:86
> LogContents:
> Error: Could not find or load main class 
> org.apache.hadoop.yarn.service.ServiceMaster
> End of LogType:serviceam-err.txt
> {code}
> This seems because the dependencies are not ready as the service client 
> reported:
> {code:java}
> 2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
> yarn.service.framework.path has a value 
> /yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
> 2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
> jars to HDFS. For faster submission of apps, set config property 
> yarn.service.framework.path to the dependency tarball location. Dependency 
> tarball can be uploaded to any HDFS path directly or by using command: yarn 
> app -enableFastLaunch []{code}
>  
> When this error happens, I found that there is no “/yarn-services” directory 
> created in HDFS.
> But after I run “yarn app -launch my-sleeper sleeper”, the “/yarn-services” 
> created in HDFS and then the submarine job can run successfully.
> {code:java}
> yarn@master0-VirtualBox:~/apache-hadoop-install-dir/hadoop-dev-workspace$ 
> hdfs dfs -ls /yarn-services/3.2.0/*
> -rwxr-xr-x 1 yarn supergroup 93596476 2019-01-11 08:23 
> /yarn-services/3.2.0/service-dep.tar.gz{code}
> It seems an issue of yarn service in 3.2.0 RC1 and I files this Jira to track 
> it.
>  
> And verified that trunk branch doesn't have this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9190) [Submarine] Submarine job will fail to run as a first job on a new created Hadoop 3.2.0 RC1 cluster

2019-01-10 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9190:
---
Description: 
This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. The 
reproduce steps are:
 # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
 # Before run any other yarn service job, use yarn user to submit a submarine 
job

The job will fail with below error:

 
{code:java}
LogType:serviceam-err.txt
LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
LogLength:86
LogContents:
Error: Could not find or load main class 
org.apache.hadoop.yarn.service.ServiceMaster
End of LogType:serviceam-err.txt
{code}
This seems because the dependencies are not ready as the service client 
reported:
{code:java}
2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
yarn.service.framework.path has a value 
/yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
jars to HDFS. For faster submission of apps, set config property 
yarn.service.framework.path to the dependency tarball location. Dependency 
tarball can be uploaded to any HDFS path directly or by using command: yarn app 
-enableFastLaunch []{code}
 

When this error happens, I found that there is no “/yarn-services” directory 
created in HDFS.

But after I run “yarn app -launch my-sleeper sleeper”, the “/yarn-services” 
created in HDFS and then the submarine job can run successfully.
{code:java}
yarn@master0-VirtualBox:~/apache-hadoop-install-dir/hadoop-dev-workspace$ hdfs 
dfs -ls /yarn-services/3.2.0/*
-rwxr-xr-x 1 yarn supergroup 93596476 2019-01-11 08:23 
/yarn-services/3.2.0/service-dep.tar.gz{code}
It seems an issue of yarn service in 3.2.0 RC1 and I files this Jira to track 
it.

 

And verified that trunk branch doesn't have this issue.

  was:
This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. The 
reproduce steps are:
 # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
 # Before run any other yarn service job, use yarn user to submit a submarine 
job

The job will fail with below error:

 
{code:java}
LogType:serviceam-err.txt
LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
LogLength:86
LogContents:
Error: Could not find or load main class 
org.apache.hadoop.yarn.service.ServiceMaster
End of LogType:serviceam-err.txt
{code}
This seems because the dependencies are not ready as the service client 
reported:
{code:java}
2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
yarn.service.framework.path has a value 
/yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
jars to HDFS. For faster submission of apps, set config property 
yarn.service.framework.path to the dependency tarball location. Dependency 
tarball can be uploaded to any HDFS path directly or by using command: yarn app 
-enableFastLaunch []{code}
 

When this error happens, I found that there is no “/yarn-services” directory 
created in HDFS.

But after I run “yarn app -launch my-sleeper sleeper”, the “/yarn-services” 
created in HDFS and then the submarine job can run successfully.

It seems an issue of yarn service in 3.2.0 RC1 and I files this Jira to track 
it.

 

And verified that trunk branch doesn't have this issue.


> [Submarine] Submarine job will fail to run as a first job on a new created 
> Hadoop 3.2.0 RC1 cluster
> ---
>
> Key: YARN-9190
> URL: https://issues.apache.org/jira/browse/YARN-9190
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Sunil Govindan
>Priority: Minor
>
> This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. 
> The reproduce steps are:
>  # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
>  # Before run any other yarn service job, use yarn user to submit a submarine 
> job
> The job will fail with below error:
>  
> {code:java}
> LogType:serviceam-err.txt
> LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
> LogLength:86
> LogContents:
> Error: Could not find or load main class 
> org.apache.hadoop.yarn.service.ServiceMaster
> End of LogType:serviceam-err.txt
> {code}
> This seems because the dependencies are not ready as the service client 
> reported:
> {code:java}
> 2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
> yarn.service.framework.path has a value 
> /yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
> 2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
> jars to HDFS. For faster submission of apps, set config property 
> yarn.service.framework.path to the dependency 

[jira] [Commented] (YARN-9190) [Submarine] Submarine job will fail to run as a first job on a new created Hadoop 3.2.0 RC1 cluster

2019-01-13 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16741752#comment-16741752
 ] 

Zhankun Tang commented on YARN-9190:


[~billie.rinaldi] , Thanks for the reply!

One thing I forget to mention is that I use the yarn script to run submarine 
job for both 3.2.0RC1 and trunk(3.3) build.
{code:java}
yarn jar 
$HADOOP_COMMON_HOME/share/hadoop/yarn/hadoop-yarn-submarine-${VERSION}.jar job 
run ...{code}
It's not clear to me that given both branches' "yarn" script already set the 
"service.libdir", why the above same submarine run script fails in 3.2 RC1?

> [Submarine] Submarine job will fail to run as a first job on a new created 
> Hadoop 3.2.0 RC1 cluster
> ---
>
> Key: YARN-9190
> URL: https://issues.apache.org/jira/browse/YARN-9190
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Zhankun Tang
>Assignee: Sunil Govindan
>Priority: Minor
>
> This issue was found when verifying submarine in Hadoop 3.2.0 RC1 planning. 
> The reproduce steps are:
>  # Init a new HDFS and YARN (LinuxContainerExecutor and Docker enabled)
>  # Before run any other yarn service job, use yarn user to submit a submarine 
> job
> The job will fail with below error:
>  
> {code:java}
> LogType:serviceam-err.txt
> LogLastModifiedTime:Thu Jan 10 21:15:23 +0800 2019
> LogLength:86
> LogContents:
> Error: Could not find or load main class 
> org.apache.hadoop.yarn.service.ServiceMaster
> End of LogType:serviceam-err.txt
> {code}
> This seems because the dependencies are not ready as the service client 
> reported:
> {code:java}
> 2019-01-10 21:50:47,380 WARN client.ServiceClient: Property 
> yarn.service.framework.path has a value 
> /yarn-services/3.2.0/service-dep.tar.gz, but is not a valid file
> 2019-01-10 21:50:47,381 INFO client.ServiceClient: Uploading all dependency 
> jars to HDFS. For faster submission of apps, set config property 
> yarn.service.framework.path to the dependency tarball location. Dependency 
> tarball can be uploaded to any HDFS path directly or by using command: yarn 
> app -enableFastLaunch []{code}
>  
> When this error happens, I found that there is no “/yarn-services” directory 
> created in HDFS.
> But after I run “yarn app -launch my-sleeper sleeper”, the “/yarn-services” 
> created in HDFS and then the submarine job can run successfully.
> {code:java}
> yarn@master0-VirtualBox:~/apache-hadoop-install-dir/hadoop-dev-workspace$ 
> hdfs dfs -ls /yarn-services/3.2.0/*
> -rwxr-xr-x 1 yarn supergroup 93596476 2019-01-11 08:23 
> /yarn-services/3.2.0/service-dep.tar.gz{code}
> It seems an issue of yarn service in 3.2.0 RC1 and I files this Jira to track 
> it.
>  
> And verified that trunk branch doesn't have this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times

2018-09-17 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8725:
---
Attachment: YARN-8725-trunk.001.patch

> Submarine job staging directory has a lot of useless 
> PRIMARY_WORKER-launch-script-***.sh  scripts when submitting a job multiple 
> times
> --
>
> Key: YARN-8725
> URL: https://issues.apache.org/jira/browse/YARN-8725
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8725-trunk.001.patch
>
>
> Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and 
> PRIMARY_WORKER-launch-script.sh to staging dir.
> The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job 
> is submitted multiple times.
> But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has 
> random numbers in its name.
> The files in the staging dir are as follows:
> {code:java}
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh
> -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh
> -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml
> -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml
> -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info
> {code}
>  
> We should stop the staging dir from growing or have a way to clean it up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times

2018-09-17 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617600#comment-16617600
 ] 

Zhankun Tang commented on YARN-8725:


Added a patch which does following:
 # add a new option "--keep_staging_dir". It's false by default so that we'll 
clean up the staging directory after job finish
 # added unit test case through "MockRemoteDirectoryManager".
 # Changes(staging dir creation) to existing unit test due to the need for a 
real directory in local fs for "cleanupStagingDir" to work

Please help review. [~wangda] [~sunilg] [~yuan_zac]

> Submarine job staging directory has a lot of useless 
> PRIMARY_WORKER-launch-script-***.sh  scripts when submitting a job multiple 
> times
> --
>
> Key: YARN-8725
> URL: https://issues.apache.org/jira/browse/YARN-8725
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8725-trunk.001.patch
>
>
> Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and 
> PRIMARY_WORKER-launch-script.sh to staging dir.
> The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job 
> is submitted multiple times.
> But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has 
> random numbers in its name.
> The files in the staging dir are as follows:
> {code:java}
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh
> -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh
> -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml
> -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml
> -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info
> {code}
>  
> We should stop the staging dir from growing or have a way to clean it up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times

2018-09-17 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617600#comment-16617600
 ] 

Zhankun Tang edited comment on YARN-8725 at 9/17/18 2:36 PM:
-

Added a patch which does following:
 # add a new option "--keep_staging_dir". It's false by default so that we'll 
clean up the staging directory after job finish
 # add unit test case through "MockRemoteDirectoryManager".
 # Changes(staging dir creation) to existing unit test due to the need for a 
real directory in local fs for "cleanupStagingDir" to work

Please help review. [~wangda] [~sunilg] [~yuan_zac]


was (Author: tangzhankun):
Added a patch which does following:
 # add a new option "--keep_staging_dir". It's false by default so that we'll 
clean up the staging directory after job finish
 # added unit test case through "MockRemoteDirectoryManager".
 # Changes(staging dir creation) to existing unit test due to the need for a 
real directory in local fs for "cleanupStagingDir" to work

Please help review. [~wangda] [~sunilg] [~yuan_zac]

> Submarine job staging directory has a lot of useless 
> PRIMARY_WORKER-launch-script-***.sh  scripts when submitting a job multiple 
> times
> --
>
> Key: YARN-8725
> URL: https://issues.apache.org/jira/browse/YARN-8725
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8725-trunk.001.patch
>
>
> Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and 
> PRIMARY_WORKER-launch-script.sh to staging dir.
> The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job 
> is submitted multiple times.
> But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has 
> random numbers in its name.
> The files in the staging dir are as follows:
> {code:java}
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh
> -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh
> -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml
> -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml
> -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info
> {code}
>  
> We should stop the staging dir from growing or have a way to clean it up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-12-10 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714443#comment-16714443
 ] 

Zhankun Tang commented on YARN-7715:


[~miklos.szeg...@cloudera.com], [~asuresh],
Is this JIRA depend on YARN-5085? Why YARN-5085 is merged into branch 2.9.0 and 
3.0.0 but this JIRA is merged into branch 3.2.0?

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2018-12-10 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Attachment: YARN-9060-trunk.004.patch

> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9099) GpuResourceAllocator.getReleasingGpus calculates number of GPUs in a wrong way

2018-12-10 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714513#comment-16714513
 ] 

Zhankun Tang commented on YARN-9099:


[~snemeth], Thanks for catching up this! The patch looks good to me. And a test 
case would be better.

> GpuResourceAllocator.getReleasingGpus calculates number of GPUs in a wrong way
> --
>
> Key: YARN-9099
> URL: https://issues.apache.org/jira/browse/YARN-9099
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9099.001.patch
>
>
> getReleasingGpus plays an important role in the calculation which happens 
> when GpuAllocator assign GPUs to a container, see: 
> GpuResourceAllocator#internalAssignGpus.
> If multiple GPUs are assigned to the same container, getReleasingGpus will 
> return an invalid number.
> The iterator goes over on mappings of (GPU device, container ID) and it 
> retrieves the container by its ID the number of times the container ID is 
> mapped to any device.
> Then for every container, the resource value for the GPU resource is added to 
> a running sum.
> Obviously, if a container is mapped to 2 or more devices, then the 
> container's GPU resource counter is added to the running sum as many times as 
> the number of GPU devices the container has.
> Example: 
> Let's suppose {{usedDevices}} contains these mappings: 
> - (GPU1, container1)
> - (GPU2, container1)
> - (GPU3, container2)
> GPU resource value is 2 for container1 and 
> GPU resource value is 1 for container2.
> Then, if container1 is in a running state, getReleasingGpus will return 4 
> instead of 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9104) Fix the bug in DeviceMappingManager#getReleasingDevices

2018-12-10 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9104:
--

 Summary: Fix the bug in DeviceMappingManager#getReleasingDevices
 Key: YARN-9104
 URL: https://issues.apache.org/jira/browse/YARN-9104
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


When one container is assigned with multiple devices and in releasing state. 
This same containerId looping causes multiple times releasing device count sum. 
It involved a bug which is the same as mentioned in YARN-9099.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9103) Fix the bug in DeviceMappingManager#getReleasingDevices

2018-12-10 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9103:
--

 Summary: Fix the bug in DeviceMappingManager#getReleasingDevices
 Key: YARN-9103
 URL: https://issues.apache.org/jira/browse/YARN-9103
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


When one container is assigned with multiple devices and in releasing state. 
This same containerId looping causes multiple times releasing device count sum. 
It involved a bug which is the same as mentioned in YARN-9099.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9104) Fix the bug in DeviceMappingManager#getReleasingDevices

2018-12-10 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-9104.

Resolution: Duplicate

Resolve this due to JIRA's duplicated the creation

> Fix the bug in DeviceMappingManager#getReleasingDevices
> ---
>
> Key: YARN-9104
> URL: https://issues.apache.org/jira/browse/YARN-9104
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> When one container is assigned with multiple devices and in releasing state. 
> This same containerId looping causes multiple times releasing device count 
> sum. It involved a bug which is the same as mentioned in YARN-9099.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9033) ResourceHandlerChain#bootstrap is invoked twice during NM start if LinuxContainerExecutor enabled

2018-12-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725010#comment-16725010
 ] 

Zhankun Tang commented on YARN-9033:


[~snemeth], thanks for looking at this. 
{quote}"But actually, the "updateContainer" invocation in YARN-7715 depend on 
containerId's cgroups path creation in "preStart" method which only happens 
when we use "LinuxContainerExecutor"."

Where can I find this code part / what should I check?
{quote}
You can just test with LCE disabled but cGroupsMemoryResourceHandlerImpl 
enabled to try if YARN-7715 works. Per my testing, it doesn't work.

Or understand that "updateContainer" in YARN-7715 is actually doing cgroups 
update. This cgroups update depend on an existing cgroups path. Take 
cGroupsMemoryResourceHandlerImpl for instance,

The cGroupsMemoryResourceHandlerImpl#preStart created the memory cgroups path. 
And cGroupsMemoryResourceHandlerImpl#updateContainer update cgroups value in 
this path.

But the preStart can only be invoked by LCE using ResourceHandlerChain's 
preStart. So YARN-7715 depend on LCE enabled. It shouldn't bootstrap 
ResourceHandleChain again. Not sure if this makes sense to you.

> ResourceHandlerChain#bootstrap is invoked twice during NM start if 
> LinuxContainerExecutor enabled
> -
>
> Key: YARN-9033
> URL: https://issues.apache.org/jira/browse/YARN-9033
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9033-trunk.001.patch, YARN-9033-trunk.002.patch
>
>
> The ResourceHandlerChain#bootstrap will always be invoked in NM's 
> ContainerScheduler#serviceInit (Involved by YARN-7715)
> So if LCE is enabled, the ResourceHandlerChain#bootstrap will be invoked 
> first and then invoked again in ContainerScheduler#serviceInit.
> But actually, the "updateContainer" invocation in YARN-7715 depend on 
> containerId's cgroups path creation in "preStart" method which only happens 
> when we use "LinuxContainerExecutor". So the bootstrap of 
> ResourceHandlerChain shouldn't happen in ContainerScheduler#serviceInit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2019-01-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Attachment: YARN-9060-trunk.012.patch

> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, 
> YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, 
> YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, 
> YARN-9060-trunk.009.patch, YARN-9060-trunk.010.patch, 
> YARN-9060-trunk.011.patch, YARN-9060-trunk.012.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2019-01-24 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Attachment: YARN-9060-trunk.011.patch

> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, 
> YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, 
> YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, 
> YARN-9060-trunk.009.patch, YARN-9060-trunk.010.patch, 
> YARN-9060-trunk.011.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2019-01-24 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Attachment: YARN-9060-trunk.010.patch

> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, 
> YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, 
> YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, 
> YARN-9060-trunk.009.patch, YARN-9060-trunk.010.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-01-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Attachment: YARN-8821-trunk.001.patch

> GPU hierarchy/topology scheduling support
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8821-trunk.001.patch
>
>
> GPU topology affects performance dramatically. There's been a discussion in 
> YARN-7481. But we'd like to move related discussions here.
> Please note that YARN-8851 will provide a pluggable device framework which 
> can support plugin custom scheduler. And based on the framework, GPU plugin 
> could have own topology scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-01-26 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
GPU topology affects performance dramatically. There's been a discussion in 
YARN-7481. But we'd like to move related discussions here.

Please note that YARN-8851 will provide a pluggable device framework which can 
support plugin custom scheduler. And based on the framework, GPU plugin could 
have own topology scheduler.

  was:
GPU topology affects performance dramatically. There's been a discussion in 
YARN-7481. But we'd like to move related discussions here.

Please note that YARN-8851 will provide a pluggable device framework which has 
a shared scheduler which could support default topology scheduling. And Based 
on the framework, GPU plugin could have custom scheduler too.


> GPU hierarchy/topology scheduling support
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> GPU topology affects performance dramatically. There's been a discussion in 
> YARN-7481. But we'd like to move related discussions here.
> Please note that YARN-8851 will provide a pluggable device framework which 
> can support plugin custom scheduler. And based on the framework, GPU plugin 
> could have own topology scheduler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)

2019-01-22 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748839#comment-16748839
 ] 

Zhankun Tang commented on YARN-9205:


[~sunilg], [~leftnoteasy] ,

The v08.patch is the latest patch for review. Pending on the Jenkins result.

It changes the YarnConfiguration to load resource-types.xml. And the test cases 
related to it is moved to a new test file to avoid unknown conflicts with other 
test cases.

The errors in TestVolumeProcessor is resolved with the help from [~cheersyang].

The testCSMetrics test case can pass in local but haven't found the reason why 
it fails in Jenkins.

> When using custom resource type, application will fail to run due to the 
> CapacityScheduler throws 
> InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) 
> ---
>
> Key: YARN-9205
> URL: https://issues.apache.org/jira/browse/YARN-9205
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Critical
> Attachments: YARN-9205-trunk.001.patch, YARN-9205-trunk.002.patch, 
> YARN-9205-trunk.003.patch, YARN-9205-trunk.004.patch, 
> YARN-9205-trunk.005.patch, YARN-9205-trunk.006.patch, 
> YARN-9205-trunk.007.patch, YARN-9205-trunk.008.patch
>
>
> In a non-secure cluster. Reproduce it as follows:
>  # Set capacity scheduler in yarn-site.xml
>  # Use default capacity-scheduler.xml
>  # Set custom resource type "cmp.com/hdw" in resource-types.xml
>  # Set a value say 10 in node-resources.xml
>  # Start cluster
>  # Submit a distribute shell application which requests some "cmp.com/hdw"
> The AM will get an exception from CapacityScheduler and then failed. This bug 
> doesn't exist in FairScheduler.
> {code:java}
> 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: 
> GUARANTEED, Enforce Execution Type: false}]Resource Profile[]
> 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[cmp.com/hdw], 
> Requested resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation=
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> ...{code}
> Did a roughly debugging, below method return the wrong maximum capacity.
> DefaultAMSProcessor.java, Line 234.
> {code:java}
> Resource maximumCapacity =
>  getScheduler().getMaximumResourceCapability(app.getQueue());{code}
> The above code seems should return "" 
> but returns "".
> This incorrect value might be caused by queue maximum allocation calculation 
> involved in YARN-8720:
> AbstractCSQueue.java Line364
> {code:java}
> this.maximumAllocation =
>  configuration.getMaximumAllocationPerQueue(
>  getQueuePath());{code}
> And this invokes CapacitySchedulerConfiguration.java Line 895:
> {code:java}
> Resource clusterMax = 

[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation in native container-executor

2019-01-27 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Description: 
Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
the value of the device cgroups controller unless we have the root permission 
([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
 So we need to support this in container-executor for Java layer to invoke.

This Jira will have three parts:
 # native c-e module
 # Java layer code to isolate devices for container (docker and non-docker)
 # A sample Nvidia GPU plugin

  was:Due to the cgroups v1 implementation policy in linux kernel, we cannot 
update the value of the device cgroups controller unless we have the root 
permission 
([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
 So we need to support this in container-executor for Java layer to invoke.


> [YARN-8851] Phase 1 - Support device isolation in native container-executor
> ---
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, 
> YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, 
> YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, 
> YARN-9060-trunk.009.patch, YARN-9060-trunk.010.patch, 
> YARN-9060-trunk.011.patch, YARN-9060-trunk.012.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.
> This Jira will have three parts:
>  # native c-e module
>  # Java layer code to isolate devices for container (docker and non-docker)
>  # A sample Nvidia GPU plugin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9060) [YARN-8851] Phase 1 - Support device isolation and use the Nvidia GPU plugin as an example

2019-01-27 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9060:
---
Summary: [YARN-8851] Phase 1 - Support device isolation and use the Nvidia 
GPU plugin as an example  (was: [YARN-8851] Phase 1 - Support device isolation 
in native container-executor)

> [YARN-8851] Phase 1 - Support device isolation and use the Nvidia GPU plugin 
> as an example
> --
>
> Key: YARN-9060
> URL: https://issues.apache.org/jira/browse/YARN-9060
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9060-trunk.001.patch, YARN-9060-trunk.002.patch, 
> YARN-9060-trunk.003.patch, YARN-9060-trunk.004.patch, 
> YARN-9060-trunk.005.patch, YARN-9060-trunk.006.patch, 
> YARN-9060-trunk.007.patch, YARN-9060-trunk.008.patch, 
> YARN-9060-trunk.009.patch, YARN-9060-trunk.010.patch, 
> YARN-9060-trunk.011.patch, YARN-9060-trunk.012.patch
>
>
> Due to the cgroups v1 implementation policy in linux kernel, we cannot update 
> the value of the device cgroups controller unless we have the root permission 
> ([here|https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/security/device_cgroup.c#L604]).
>  So we need to support this in container-executor for Java layer to invoke.
> This Jira will have three parts:
>  # native c-e module
>  # Java layer code to isolate devices for container (docker and non-docker)
>  # A sample Nvidia GPU plugin



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Attachment: YARN-8821-trunk.007.patch

> GPU hierarchy/topology scheduling support
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8821-trunk.001.patch, YARN-8821-trunk.002.patch, 
> YARN-8821-trunk.003.patch, YARN-8821-trunk.004.patch, 
> YARN-8821-trunk.005.patch, YARN-8821-trunk.006.patch, 
> YARN-8821-trunk.007.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 10%* speedup.
> 3. Our current version of topology scheduling algorithm can achieve *3% to 
> 140%* *performance gain. And the algorithm's allocations match the fastest 
> GPUs needed by "vgg16"*.
>     For "alexnet", although the fastest GPUs is not the algorithm's 
> allocation, the GPU subset ranks in the first 5 of the algorithm's candidates 
> and has the same cost with the one picked by the algorithm. We may improve 
> this by selecting a random combination in the first 5 candidates since they 
> have the same cost.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 5% to 185% performance gain after more optimization.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cost table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPUs are not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.
h2. Estimation of the algorithm

Initial analysis of the topology scheduling algorithm(Using PACK policy) based 
on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done.

 

Some of the conclusions are:
1. The topology between GPUs impacts the performance dramatically. The best 
combination GPUs can get *5% to 185%* *performance gain* among the test cases 
with various factors including CNN model, batch size, GPU subset, etc.
2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
topology scheduling can only potentially get *about 10%* speedup.
3. Our current version of topology scheduling algorithm can achieve *3% to 
140%* *performance gain. And the algorithm's allocations match the fastest GPUs 
needed by "vgg16"*.
 
In summary, the GPU topology scheduling algorithm is effective and can 
potentially get 5% to 185% performance gain after more optimization.
 *It means about maximum 3X comparing to a random GPU scheduling algorithm in a 
specific scenario*.
 
The spreadsheets are here for your reference.
 
[https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]

  was:
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost 

[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Attachment: GPUTopologyPerformance.png

> GPU hierarchy/topology scheduling support
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done.
>  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 10%* speedup.
> 3. Our current version of topology scheduling algorithm can achieve *3% to 
> 140%* *performance gain. And the algorithm's allocations match the fastest 
> GPUs needed by "vgg16"*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 5% to 185% performance gain after more optimization.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cost table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPUs are not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.
h2. Estimation of the algorithm

Initial analysis of the topology scheduling algorithm(Using PACK policy) based 
on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done.

!GPUTopologyPerformance.png!  

Some of the conclusions are:
1. The topology between GPUs impacts the performance dramatically. The best 
combination GPUs can get *5% to 185%* *performance gain* among the test cases 
with various factors including CNN model, batch size, GPU subset, etc.
2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
topology scheduling can only potentially get *about 10%* speedup.
3. Our current version of topology scheduling algorithm can achieve *3% to 
140%* *performance gain. And the algorithm's allocations match the fastest GPUs 
needed by "vgg16"*.
 
In summary, the GPU topology scheduling algorithm is effective and can 
potentially get 5% to 185% performance gain after more optimization.
 *It means about maximum 3X comparing to a random GPU scheduling algorithm in a 
specific scenario*.
 
The spreadsheets are here for your reference.
 
[https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]

  was:
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the 

[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cost table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPUs are not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.
h2. Estimation of the algorithm

Initial analysis of the topology scheduling algorithm(Using PACK policy) based 
on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. Below 
figure shows the performance of the topology scheduling algorithm's allocation.

!GPUTopologyPerformance.png!  

Some of the conclusions are:
1. The topology between GPUs impacts the performance dramatically. The best 
combination GPUs can get *5% to 185%* *performance gain* among the test cases 
with various factors including CNN model, batch size, GPU subset, etc. The 
scheduling algorithm should be close to this fact.
2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
best cases.
3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
177.1%* *performance gain in best cases. In average, it also outperforms the 
median performance(0.8% to 28.2%).*

*4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
best*.
 
In summary, the GPU topology scheduling algorithm is effective and can 
potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
on average.
 *It means about maximum 3X comparing to a random GPU scheduling algorithm in a 
specific scenario*.
 
The spreadsheets are here for your reference.
 
[https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]

  was:
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost 

[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Attachment: YARN-8821-trunk.008.patch

> GPU hierarchy/topology scheduling support
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance of the topology scheduling algorithm's 
> allocation.
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-

[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Attachment: YARN-8821-trunk.009.patch

> GPU hierarchy/topology scheduling support
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance of the topology scheduling algorithm's 
> allocation.
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> [https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cost table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPUs are not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.
h2. Estimation of the algorithm

Initial analysis of the topology scheduling algorithm(Using PACK policy) based 
on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. Below 
figure shows the performance gain of the topology scheduling algorithm's 
allocation (PACK policy).

!GPUTopologyPerformance.png!  

Some of the conclusions are:
1. The topology between GPUs impacts the performance dramatically. The best 
combination GPUs can get *5% to 185%* *performance gain* among the test cases 
with various factors including CNN model, batch size, GPU subset, etc. The 
scheduling algorithm should be close to this fact.
2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
best cases.
3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
177.1%* *performance gain in best cases. In average, it also outperforms the 
median performance(0.8% to 28.2%).*

*4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
best*.
 
In summary, the GPU topology scheduling algorithm is effective and can 
potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
on average.
 *It means about maximum 3X comparing to a random GPU scheduling algorithm in a 
specific scenario*.
 
The spreadsheets are here for your reference.
 
[https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]

  was:
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them 

[jira] [Updated] (YARN-8821) GPU hierarchy/topology scheduling support

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Description: 
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them and cache it. The cost table is a 
map whose structure is like
{code:java}
{ 2=>{[0,1]=>2,..},
  3=>{[0,1,2]=>10,..},
  4=>{[0,1,2,3]=>18}}.
{code}
The key of the map is the count of GPUs, the value of it is a map whose key is 
the combination of GPUs and the value is the calculated communication cost of 
the numbers of GPUs. The cost calculation algorithm is to sum all non-duplicate 
pairs of GPU's cost. For instance, the total cost of [0,1,2] GPUs are the sum 
of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get from the map built 
in step 1.

*Step 3*. After the cost table is built, when allocating GPUs based on 
topology, we provide two policy which container can set through an environment 
variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or "SPREAD". The 
"PACK" means it prefers faster GPU-GPU communication. The "SPREAD" means it 
prefers faster CPU-GPU communication( since GPUs are not using the same bus to 
CPU). And the key difference of the two policy is the sort order of the inner 
map in the cost table. For instance, let's assume 2 GPUs is wanted. The 
costTable.get(2) would return a map containing all combinations of two GPUs and 
their cost. If the policy is "PACK", we'll sort the map by cost in ascending 
order. The first entry will be the GPUs has minimum GPU-GPU cost. If the policy 
is "SPREAD", we sort it in descending order and get the first one which is the 
highest GPU-GPU cost which means lowest CPU-GPU costs.
h2. Estimation of the algorithm

Initial analysis of the topology scheduling algorithm(Using PACK policy) based 
on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. Below 
figure shows the performance of the topology scheduling algorithm's allocation 
(PACK policy).

!GPUTopologyPerformance.png!  

Some of the conclusions are:
 1. The topology between GPUs impacts the performance dramatically. The best 
combination GPUs can get *5% to 185%* *performance gain* among the test cases 
with various factors including CNN model, batch size, GPU subset, etc. The 
scheduling algorithm should be close to this fact.
 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
best cases.
 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
177.1%* *performance gain in best cases. In average, it also outperforms the 
median performance(0.8% to 28.2%).*

*4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
best*.
  
 In summary, the GPU topology scheduling algorithm is effective and can 
potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
on average.
 *It means about maximum 3X comparing to a random GPU scheduling algorithm in a 
specific scenario*.
  
 The spreadsheets are here for your reference.
 
[https://docs.google.com/spreadsheets/d/1t1QgiSuyMY2u-9TtsTVpVhG3WYc46hoaqy3BuADPS14/edit?usp=sharing]

  was:
h2. Background

GPU topology affects performance. There's been a discussion in YARN-7481. But 
we'd like to move related discussions here.

And please note that YARN-8851 will provide a pluggable device framework which 
can support plugin custom scheduler. Based on the framework, GPU plugin could 
have own topology scheduler.
h2. Details of the proposed scheduling algorithm

The proposed patch has a topology algorithm implemented as below:
 *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" to 
build a hash map whose key is all pairs of GPUs and the value is the 
communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 2"=>4, 
...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set based on 
the connection type.

*Step 2*. And then it constructs a _+cost table+_ which caches all combinations 
of GPUs and corresponding cost between them 

[jira] [Updated] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-18 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Summary: [YARN-8851] GPU hierarchy/topology scheduling support based on 
pluggable device framework  (was: GPU hierarchy/topology scheduling support)

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-24 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776288#comment-16776288
 ] 

Zhankun Tang commented on YARN-8821:


[~sunilg] , [~Weiwei Yang] , Thanks for the review!

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for 

[jira] [Commented] (YARN-9331) [YARN-8851] Fix a bug that lacking cgroup initialization when bootstrap DeviceResourceHandlerImpl

2019-02-25 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777491#comment-16777491
 ] 

Zhankun Tang commented on YARN-9331:


[~cheersyang], Thanks a lot!

> [YARN-8851] Fix a bug that lacking cgroup initialization when bootstrap 
> DeviceResourceHandlerImpl
> -
>
> Key: YARN-9331
> URL: https://issues.apache.org/jira/browse/YARN-9331
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9331-trunk.001.patch
>
>
> The YARN-9060 is a huge patch merge. Found it lacks a cgroup initialization 
> by mistake when generating the patch. The local testing passed at that time 
> is because of the 
> "linux-container-executor.cgroups.mount" is false and the local device 
> cgroups already setup.
> We need to add the initialization to avoid failure if 
> "linux-container-executor.cgroups.mount" is true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9331) [YARN-8851] Fix a bug that lacking cgroup initialization when bootstrap DeviceResourceHandlerImpl

2019-02-25 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9331:
---
Summary: [YARN-8851] Fix a bug that lacking cgroup initialization when 
bootstrap DeviceResourceHandlerImpl  (was: Fix a bug that lacking cgroup 
initialization when bootstrap DeviceResourceHandlerImpl)

> [YARN-8851] Fix a bug that lacking cgroup initialization when bootstrap 
> DeviceResourceHandlerImpl
> -
>
> Key: YARN-9331
> URL: https://issues.apache.org/jira/browse/YARN-9331
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9331-trunk.001.patch
>
>
> The YARN-9060 is a huge patch merge. Found it lacks a cgroup initialization 
> by mistake when generating the patch. The local testing passed at that time 
> is because of the 
> "linux-container-executor.cgroups.mount" is false and the local device 
> cgroups already setup.
> We need to add the initialization to avoid failure if 
> "linux-container-executor.cgroups.mount" is true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8887) Support isolation in pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8887.

Resolution: Duplicate

Resolve it as duplicated with YAR-9060

> Support isolation in pluggable device framework
> ---
>
> Key: YARN-8887
> URL: https://issues.apache.org/jira/browse/YARN-8887
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Devices isolation needs a complete description in API 
> specs(DeviceRuntimeSpec) and a translator in the adapter to convert the 
> requirements into uniform parameters passed to native container-executor. It 
> should support both default and Docker container.
> For default container, we use a new device module in container-executor to 
> isolate device. For docker container, we depend on current 
> DockerLinuxContainerRuntime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8889) Add well-defined interface in container-executor to support vendor plugins isolation request

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8889.

Resolution: Duplicate

Resolve this as already implemented in YARN-9060

> Add well-defined interface in container-executor to support vendor plugins 
> isolation request
> 
>
> Key: YARN-8889
> URL: https://issues.apache.org/jira/browse/YARN-8889
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Because of different container runtime, the isolation request from vendor 
> device plugin may be raised before container launch (cgroups operations) or 
> at container launch (Docker runtime).
> An easy to use interface in container-executor should be provided to support 
> above requirements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9103) Fix the bug in DeviceMappingManager#getReleasingDevices

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-9103.

Resolution: Won't Fix

Resolve it as it is fixed in YARN-9060

> Fix the bug in DeviceMappingManager#getReleasingDevices
> ---
>
> Key: YARN-9103
> URL: https://issues.apache.org/jira/browse/YARN-9103
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> When one container is assigned with multiple devices and in releasing state. 
> This same containerId looping causes multiple times releasing device count 
> sum. It involved a bug which is the same as mentioned in YARN-9099.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8888) Support device topology scheduling

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-.

Resolution: Won't Fix

Resolve it due to the GPU topology algorithm is better implemented in the 
plugin for now.

Abstraction for all device topology is too early now.

See YARN-8821 for GPU topology scheduling.

> Support device topology scheduling
> --
>
> Key: YARN-
> URL: https://issues.apache.org/jira/browse/YARN-
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> An easy way for vendor plugin to describe topology information should be 
> provided in Device spec and the topology information will be used in the 
> device shared local scheduler to boost performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771683#comment-16771683
 ] 

Zhankun Tang commented on YARN-8821:


The unit test seems unrelated to this patch.

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> 

[jira] [Resolved] (YARN-8883) Phase 1 - Provide an example of fake vendor plugin

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8883.

Resolution: Duplicate

Resolve it due to the YARN-9060 has an example of Nvidia GPU plugin

> Phase 1 - Provide an example of fake vendor plugin
> --
>
> Key: YARN-8883
> URL: https://issues.apache.org/jira/browse/YARN-8883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8883-trunk.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9319) YARN-9060 does not compile

2019-02-21 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9319:
---
Attachment: YARN-9319-trunk.001.patch

> YARN-9060 does not compile
> --
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9319) YARN-9060 does not compile

2019-02-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774212#comment-16774212
 ] 

Zhankun Tang commented on YARN-9319:


[~jojochuang] , it's caused by a compiler inconsistent behavior when handling 
typedef an existing name. I use a different name to avoid it. Could you do a 
double-confirm for the path?

[~sunil.gov...@gmail.com] , could you please help review?

> YARN-9060 does not compile
> --
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9319) YARN-9060 does not compile

2019-02-21 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang reassigned YARN-9319:
--

Assignee: Zhankun Tang

> YARN-9060 does not compile
> --
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9319) YARN-9060 does not compile

2019-02-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774212#comment-16774212
 ] 

Zhankun Tang edited comment on YARN-9319 at 2/21/19 3:52 PM:
-

[~jojochuang] , it's caused by a compiler inconsistent behavior when handling 
typedef an existing name. I use a different name to avoid it. Could you do a 
double-confirm for the patch?

[~sunil.gov...@gmail.com] , could you please help to review?


was (Author: tangzhankun):
[~jojochuang] , it's caused by a compiler inconsistent behavior when handling 
typedef an existing name. I use a different name to avoid it. Could you do a 
double-confirm for the patch?

[~sunil.gov...@gmail.com] , could you please help review?

> YARN-9060 does not compile
> --
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9319) YARN-9060 does not compile

2019-02-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774212#comment-16774212
 ] 

Zhankun Tang edited comment on YARN-9319 at 2/21/19 3:52 PM:
-

[~jojochuang] , it's caused by a compiler inconsistent behavior when handling 
typedef an existing name. I use a different name to avoid it. Could you do a 
double-confirm for the patch?

[~sunil.gov...@gmail.com] , could you please help review?


was (Author: tangzhankun):
[~jojochuang] , it's caused by a compiler inconsistent behavior when handling 
typedef an existing name. I use a different name to avoid it. Could you do a 
double-confirm for the path?

[~sunil.gov...@gmail.com] , could you please help review?

> YARN-9060 does not compile
> --
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8890) Port existing GPU module into pluggable device framework

2019-03-06 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8890.

Resolution: Duplicate

Since we have a sample Nvidia GPU plugin merged in YARN-9060. Need not to do 
this again.

> Port existing GPU module into pluggable device framework
> 
>
> Key: YARN-8890
> URL: https://issues.apache.org/jira/browse/YARN-8890
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Once we get pluggable device framework mature, we can port existing GPU 
> related code into this new framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8891) Documentation of the pluggable device framework

2019-02-22 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775067#comment-16775067
 ] 

Zhankun Tang commented on YARN-8891:


[~sunilg] , the Jenkins result is ok too. Could you help to merge it? Thanks

> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, 
> YARN-8891-trunk.003.patch, YARN-8891-trunk.004.patch, 
> YARN-8891-trunk.005.patch, YARN-8891-trunk.006.patch, 
> YARN-8891-trunk.007.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8891) Documentation of the pluggable device framework

2019-02-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774833#comment-16774833
 ] 

Zhankun Tang commented on YARN-8891:


[~sunilg] , All great suggestions. Thanks a lot! Fixed all above points.

> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, 
> YARN-8891-trunk.003.patch, YARN-8891-trunk.004.patch, 
> YARN-8891-trunk.005.patch, YARN-8891-trunk.006.patch, 
> YARN-8891-trunk.007.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework

2019-02-21 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8891:
---
Attachment: YARN-8891-trunk.007.patch

> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, 
> YARN-8891-trunk.003.patch, YARN-8891-trunk.004.patch, 
> YARN-8891-trunk.005.patch, YARN-8891-trunk.006.patch, 
> YARN-8891-trunk.007.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9121) Users of GpuDiscoverer.getInstance() are not possible to test as instance is a static field

2019-02-24 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776499#comment-16776499
 ] 

Zhankun Tang commented on YARN-9121:


[~snemeth] , thanks for the patch.

Looks good to me. +1

> Users of GpuDiscoverer.getInstance() are not possible to test as instance is 
> a static field
> ---
>
> Key: YARN-9121
> URL: https://issues.apache.org/jira/browse/YARN-9121
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-9121.001.patch, YARN-9121.002.patch, 
> YARN-9121.003.patch
>
>
> The clients of GpuDiscoverer are very hard to test as they call 
> GpuDiscoverer.getInstance() internally.
> For example, writing tests for 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  is quite hard as the GpuDeviceInformation returned by GpuDiscoverer is not 
> interchangeable as GpuDiscoverer is not mockable since we cannot inject it in 
> tests. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9331) Fix a bug that lacking cgroup initialization when bootstrap DeviceResourceHandlerImpl

2019-02-25 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-9331:
---
Attachment: YARN-9331-trunk.001.patch

> Fix a bug that lacking cgroup initialization when bootstrap 
> DeviceResourceHandlerImpl
> -
>
> Key: YARN-9331
> URL: https://issues.apache.org/jira/browse/YARN-9331
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-9331-trunk.001.patch
>
>
> The YARN-9060 is a huge patch merge. Found it lacks a cgroup initialization 
> by mistake when generating the patch. The local testing passed at that time 
> is because of the 
> "linux-container-executor.cgroups.mount" is false and the local device 
> cgroups already setup.
> We need to add the initialization to avoid failure if 
> "linux-container-executor.cgroups.mount" is true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9331) Fix a bug that lacking cgroup initialization when bootstrap DeviceResourceHandlerImpl

2019-02-25 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-9331:
--

 Summary: Fix a bug that lacking cgroup initialization when 
bootstrap DeviceResourceHandlerImpl
 Key: YARN-9331
 URL: https://issues.apache.org/jira/browse/YARN-9331
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


The YARN-9060 is a huge patch merge. Found it lacks a cgroup initialization by 
mistake when generating the patch. The local testing passed at that time is 
because of the 

"linux-container-executor.cgroups.mount" is false and the local device cgroups 
already setup.

We need to add the initialization to avoid failure if 
"linux-container-executor.cgroups.mount" is true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9156) [YARN-8851] Improve debug message in device plugin method compatibility check of ResourcePluginManager

2019-02-21 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774684#comment-16774684
 ] 

Zhankun Tang commented on YARN-9156:


[~cheersyang] , could you please help to review this? Thanks.

> [YARN-8851] Improve debug message in device plugin method compatibility check 
> of ResourcePluginManager
> --
>
> Key: YARN-9156
> URL: https://issues.apache.org/jira/browse/YARN-9156
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Trivial
> Attachments: YARN-9156-trunk.001.patch
>
>
> {code:java}
> LOG.debug("Method {} found in class {}",
>  actualClass.getSimpleName(),
>  m.getName());{code}
> should be
> {code:java}
> LOG.debug("Method {} found in class {}",
> m.getName(), actualClass.getSimpleName());
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    3   4   5   6   7   8   9   10   11   >