[jira] [Commented] (YARN-7556) Fair scheduler configuration should allow resource types in the minResources and maxResources properties

2018-07-05 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-7556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16533975#comment-16533975 ] Wangda Tan commented on YARN-7556: -- [~haibochen], [~templedf], [~snemeth], I was thinking to post some

[jira] [Updated] (YARN-8489) Need to support pluggable termination policy for native services

2018-07-03 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8489: - Summary: Need to support pluggable termination policy for native services (was: Need to support customer

[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8193: - Fix Version/s: (was: 2.9.0) > YARN RM hangs abruptly (stops allocating resources) when running

[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8193: - Fix Version/s: 3.2.0 > YARN RM hangs abruptly (stops allocating resources) when running successive >

[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16531749#comment-16531749 ] Wangda Tan commented on YARN-8193: -- [~elgoiri], I didn't see this patch went into branch-2.9. just

[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-07-03 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8193: - Fix Version/s: (was: 3.2.0) > YARN RM hangs abruptly (stops allocating resources) when running

[jira] [Created] (YARN-8489) Need to support customer termination policy for native services

2018-07-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8489: Summary: Need to support customer termination policy for native services Key: YARN-8489 URL: https://issues.apache.org/jira/browse/YARN-8489 Project: Hadoop YARN

[jira] [Commented] (YARN-8489) Need to support customer termination policy for native services

2018-07-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530825#comment-16530825 ] Wangda Tan commented on YARN-8489: -- cc: [~gsaha], [~csingh], [~billie.rinaldi], [~eyang] > Need to

[jira] [Updated] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8488: - Target Version/s: 3.2.0 > Need to add "SUCCEED" state to YARN service >

[jira] [Commented] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530821#comment-16530821 ] Wangda Tan commented on YARN-8488: -- cc: [~gsaha], [~csingh], [~billie.rinaldi], [~eyang] > Need to add

[jira] [Created] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8488: Summary: Need to add "SUCCEED" state to YARN service Key: YARN-8488 URL: https://issues.apache.org/jira/browse/YARN-8488 Project: Hadoop YARN Issue Type: Task

[jira] [Updated] (YARN-8488) Need to add "SUCCEED" state to YARN service

2018-07-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8488: - Component/s: yarn-native-services > Need to add "SUCCEED" state to YARN service >

[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-07-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530233#comment-16530233 ] Wangda Tan commented on YARN-8459: -- Attached patch (004) which moved re-reservation to debug log, and

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-07-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Attachment: YARN-8459.004.patch > Improve logs of Capacity Scheduler to better debug invalid states >

[jira] [Commented] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2018-06-30 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528957#comment-16528957 ] Wangda Tan commented on YARN-8193: -- [~elgoiri], Jenkins will be triggered after patch submitted. > YARN

[jira] [Commented] (YARN-8471) YARN RM hangs and stops allocating resources when applications successively running

2018-06-29 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528545#comment-16528545 ] Wangda Tan commented on YARN-8471: -- [~jutia], if this is mostly same as YARN-8193, could u reopen

[jira] [Resolved] (YARN-8478) The capacity scheduler logs too frequently seriously affecting performance

2018-06-29 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8478. -- Resolution: Duplicate > The capacity scheduler logs too frequently seriously affecting performance >

[jira] [Commented] (YARN-8479) The capacity scheduler logs too frequently seriously affecting performance

2018-06-29 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528543#comment-16528543 ] Wangda Tan commented on YARN-8479: -- Thanks [~daemon], [~cheersyang], Basically, inside scheduling

[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-29 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528540#comment-16528540 ] Wangda Tan commented on YARN-8459: -- Thanks [~bibinchundatt], Can we move this to the YARN-8471? There're

[jira] [Commented] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2018-06-28 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526621#comment-16526621 ] Wangda Tan commented on YARN-8453: -- +1 to the patch, thanks [~sunilg]. > Additional Unit tests to

[jira] [Updated] (YARN-8453) Additional Unit tests to verify queue limit and max-limit with multiple resource types

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8453: - Priority: Major (was: Blocker) > Additional Unit tests to verify queue limit and max-limit with

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Priority: Major (was: Critical) > Improve logs of Capacity Scheduler to better debug invalid states >

[jira] [Commented] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525519#comment-16525519 ] Wangda Tan commented on YARN-8459: -- [~cheersyang], I come back to check the logic, this should not

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Description: Improve logs in CS to better debug invalid states (was: Improve logs in CS to better ) >

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Attachment: YARN-8459.003.patch > Improve logs of Capacity Scheduler to better debug invalid states >

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Description: Improve logs in CS to better (was: Thanks [~gopalv] for reporting this issue.  In async

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Summary: Improve logs of Capacity Scheduler to better debug invalid states (was: Capacity Scheduler

[jira] [Updated] (YARN-8459) Improve logs of Capacity Scheduler to better debug invalid states

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Priority: Critical (was: Blocker) > Improve logs of Capacity Scheduler to better debug invalid states >

[jira] [Commented] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-27 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525328#comment-16525328 ] Wangda Tan commented on YARN-8379: -- bq. we could definitely make a method inside

[jira] [Commented] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524583#comment-16524583 ] Wangda Tan commented on YARN-8379: -- [~Zian Chen], Thanks for updating the patch, Few comments: 1)

[jira] [Updated] (YARN-8464) Async scheduling thread could be interrupted when there are no NodeManagers in cluster

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8464: - Fix Version/s: 3.2.0 > Async scheduling thread could be interrupted when there are no NodeManagers > in

[jira] [Commented] (YARN-8466) Add Chaos Monkey unit test framework for feature validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524485#comment-16524485 ] Wangda Tan commented on YARN-8466: -- Thanks [~cheersyang], actually this JIRA is inspired by the

[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524472#comment-16524472 ] Wangda Tan commented on YARN-8459: -- Thanks [~sunilg], Addressed #1. For #2, it is required since we

[jira] [Updated] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Attachment: YARN-8459.002.patch > Capacity Scheduler should properly handle container allocation on

[jira] [Updated] (YARN-8466) Add Chaos Monkey unit test framework for feature validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8466: - Summary: Add Chaos Monkey unit test framework for feature validation in scale (was: Add Chaos Monkey

[jira] [Commented] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524411#comment-16524411 ] Wangda Tan commented on YARN-8466: -- And btw: this is an interesting work, but I may not have bandwidth to

[jira] [Updated] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8466: - Attachment: YARN-8466.poc.001.patch > Add Chaos Monkey unit test framework for validation in scale >

[jira] [Commented] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524409#comment-16524409 ] Wangda Tan commented on YARN-8466: -- Added a prototype which includes example chaos monkey tests for

[jira] [Created] (YARN-8466) Add Chaos Monkey unit test framework for validation in scale

2018-06-26 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8466: Summary: Add Chaos Monkey unit test framework for validation in scale Key: YARN-8466 URL: https://issues.apache.org/jira/browse/YARN-8466 Project: Hadoop YARN

[jira] [Commented] (YARN-8464) Async scheduling thread could be interrupted when there are no NodeManagers in cluster

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524322#comment-16524322 ] Wangda Tan commented on YARN-8464: -- Patch LGTM, thanks [~sunilg], will commit today if no objections. >

[jira] [Updated] (YARN-8464) Async scheduling thread could be interrupted when there are no NodeManagers in cluster

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8464: - Priority: Blocker (was: Critical) > Async scheduling thread could be interrupted when there are no

[jira] [Commented] (YARN-8464) Application does not get to Running state even with available resources on node managers when async scheduling is enabled

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524309#comment-16524309 ] Wangda Tan commented on YARN-8464: -- [~sunilg], mind to update the desc/title to the root cause? {code}

[jira] [Commented] (YARN-1013) CS should watch resource utilization of containers and allocate speculative containers if appropriate

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524225#comment-16524225 ] Wangda Tan commented on YARN-1013: -- Thanks [~haibochen] for explanations, bq. we are trying to just

[jira] [Commented] (YARN-1013) CS should watch resource utilization of containers and allocate speculative containers if appropriate

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524101#comment-16524101 ] Wangda Tan commented on YARN-1013: -- Just took a very quick look at YARN-1015. IIUC, scheduler allocates O

[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524057#comment-16524057 ] Wangda Tan commented on YARN-8459: -- [~cheersyang], According to our current locking design of

[jira] [Commented] (YARN-8462) Resource Manager shutdown with FATAL Exception

2018-06-26 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524042#comment-16524042 ] Wangda Tan commented on YARN-8462: -- [~jlowe], it seems this issue is fixed by YARN-8193 already. >

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523208#comment-16523208 ] Wangda Tan commented on YARN-8423: -- +1, thanks [~sunilg], could u create a JIRA to add tests? Let's get

[jira] [Commented] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522972#comment-16522972 ] Wangda Tan commented on YARN-8459: -- Attached ver.1 patch to run Jenkins, I felt it might be not

[jira] [Updated] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Attachment: YARN-8459.001.patch > Capacity Scheduler should properly handle container allocation on

[jira] [Created] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8459: Summary: Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler Key: YARN-8459 URL:

[jira] [Updated] (YARN-8459) Capacity Scheduler should properly handle container allocation on app/node when app/node being removed by scheduler

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8459: - Affects Version/s: 3.1.0 Target Version/s: 3.1.1 Priority: Blocker (was: Major)

[jira] [Updated] (YARN-8453) Allocation to a queue is dishonored if one resource is at the limit

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8453: - Target Version/s: 3.1.1, 3.0.4 Priority: Blocker (was: Major) > Allocation to a queue is

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522897#comment-16522897 ] Wangda Tan commented on YARN-8220: -- Attached ver.4 patch, removed duplicated contents inside Dockerfile

[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8220: - Attachment: YARN-8220.004.patch > Running Tensorflow on YARN with GPU and Docker - Examples >

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522855#comment-16522855 ] Wangda Tan commented on YARN-8220: -- Attached ver.3 patch, added several fixes to submit-tf-job.py helper

[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-25 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8220: - Attachment: YARN-8220.003.patch > Running Tensorflow on YARN with GPU and Docker - Examples >

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-22 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520968#comment-16520968 ] Wangda Tan commented on YARN-8423: -- Thanks [~sunilg],  Overall looks good, except:  {code:java} 266

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512889#comment-16512889 ] Wangda Tan commented on YARN-8423: -- Thanks [~shaneku...@gmail.com], I saw this error happens outside of

[jira] [Assigned] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-8423: Assignee: Sunil Govindan (was: Wangda Tan) > GPU does not get released even though the

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512022#comment-16512022 ] Wangda Tan commented on YARN-8423: -- A possible simple fix to workaround the issue is to mark "releasing

[jira] [Updated] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8423: - Target Version/s: 3.1.1 > GPU does not get released even though the application gets killed. >

[jira] [Updated] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8423: - Attachment: kill-container-nm.log > GPU does not get released even though the application gets killed. >

[jira] [Commented] (YARN-8423) GPU does not get released even though the application gets killed.

2018-06-14 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512019#comment-16512019 ] Wangda Tan commented on YARN-8423: -- Thanks [~ssath...@hortonworks.com] for filing the issue. I took a

[jira] [Updated] (YARN-8415) TimelineWebServices.getEntity should throw a ForbiddenException(403) instead of 404 when ACL checks fail

2018-06-12 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8415: - Reporter: Sumana Sathish (was: Suma Shivaprasad) > TimelineWebServices.getEntity should throw a

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509094#comment-16509094 ] Wangda Tan commented on YARN-8220: -- Discussed with [~eyang] about this and did some tests: Currently,

[jira] [Created] (YARN-8417) Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container.

2018-06-11 Thread Wangda Tan (JIRA)
Wangda Tan created YARN-8417: Summary: Should skip passing HDFS_HOME, HADOOP_CONF_DIR, JAVA_HOME, etc. to Docker container. Key: YARN-8417 URL: https://issues.apache.org/jira/browse/YARN-8417 Project:

[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508958#comment-16508958 ] Wangda Tan commented on YARN-8242: -- Bulk update on non-blocker issues which are targeted to 3.1.1: If

[jira] [Commented] (YARN-8414) Nodemanager crashes soon if ATSv2 HBase is either down or absent

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508954#comment-16508954 ] Wangda Tan commented on YARN-8414: -- Bulk update on non-blocker issues which are targeted to 3.1.1: If

[jira] [Updated] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8257: - Target Version/s: 3.1.2 > Native service should automatically adding escapes for environment/launch cmd

[jira] [Updated] (YARN-8257) Native service should automatically adding escapes for environment/launch cmd before sending to YARN

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8257: - Target Version/s: (was: 3.1.1) > Native service should automatically adding escapes for

[jira] [Commented] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508952#comment-16508952 ] Wangda Tan commented on YARN-8234: -- Bulk update on non-blocker issues which are targeted to 3.1.1: If

[jira] [Commented] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508948#comment-16508948 ] Wangda Tan commented on YARN-8242: -- Given there's no more progress on this Jira, and this is not a

[jira] [Updated] (YARN-8242) YARN NM: OOM error while reading back the state store on recovery

2018-06-11 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8242: - Priority: Critical (was: Blocker) > YARN NM: OOM error while reading back the state store on recovery >

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-10 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507553#comment-16507553 ] Wangda Tan commented on YARN-8220: -- [~eyang], Fair enough, could u help to give some examples of how to

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-10 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507523#comment-16507523 ] Wangda Tan commented on YARN-8220: -- Attached ver.2 patch, fixed jenkins reported warnings. Addressed the

[jira] [Updated] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-10 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8220: - Attachment: YARN-8220.002.patch > Running Tensorflow on YARN with GPU and Docker - Examples >

[jira] [Commented] (YARN-8394) Improve data locality documentation for Capacity Scheduler

2018-06-06 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503721#comment-16503721 ] Wangda Tan commented on YARN-8394: -- +1, thanks [~cheersyang] for the patch. > Improve data locality

[jira] [Commented] (YARN-5139) [Umbrella] Move YARN scheduler towards global scheduler

2018-06-05 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501947#comment-16501947 ] Wangda Tan commented on YARN-5139: -- [~zhuqi], It is committed to 2.9.0 and after. Welcome to help :), not

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499130#comment-16499130 ] Wangda Tan commented on YARN-8220: -- Reopened Jira to trigger Jenkins. > Running Tensorflow on YARN with

[jira] [Reopened] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reopened YARN-8220: -- > Running Tensorflow on YARN with GPU and Docker - Examples >

[jira] [Resolved] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-02 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-8220. -- Resolution: Later > Running Tensorflow on YARN with GPU and Docker - Examples >

[jira] [Updated] (YARN-8349) Remove YARN registry entries when a service is killed by the RM

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8349: - Priority: Critical (was: Major) > Remove YARN registry entries when a service is killed by the RM >

[jira] [Updated] (YARN-8372) Distributed shell app master should not release containers when shutdown if keep-container is true

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8372: - Priority: Critical (was: Major) > Distributed shell app master should not release containers when

[jira] [Updated] (YARN-8372) Distributed shell app master should not release containers when shutdown if keep-container is true

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8372: - Summary: Distributed shell app master should not release containers when shutdown if keep-container is

[jira] [Updated] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer causes RM crash during failover

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-7962: - Summary: Race Condition When Stopping DelegationTokenRenewer causes RM crash during failover (was: Race

[jira] [Assigned] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-7962: Assignee: BELUGA BEHR > Race Condition When Stopping DelegationTokenRenewer >

[jira] [Updated] (YARN-8384) stdout.txt, stderr.txt logs of a launched docker container is coming with primary group of submit user instead of hadoop

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8384: - Description: When {{yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users}} is set to

[jira] [Updated] (YARN-8384) stdout.txt, stderr.txt logs of a launched docker container is coming with primary group of submit user instead of hadoop

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8384: - Summary: stdout.txt, stderr.txt logs of a launched docker container is coming with primary group of

[jira] [Commented] (YARN-8372) ApplicationAttemptNotFoundException should be handled correctly by Distributed Shell App Master

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498548#comment-16498548 ] Wangda Tan commented on YARN-8372: -- Patch LGTM, +1. Will commit today if no objections. >

[jira] [Commented] (YARN-8349) Remove YARN registry entries when a service is killed by the RM

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498543#comment-16498543 ] Wangda Tan commented on YARN-8349: -- Thank [~billie.rinaldi] for the patch, +1. Will commit by today if no

[jira] [Updated] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8379: - Description: Existing capacity scheduler only supports preemption for an underutilized queue to reach

[jira] [Updated] (YARN-8379) Add an option to allow Capacity Scheduler preemption to balance satisfied queues

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8379: - Description: Existing capacity scheduler only supports preemption for an underutilized queue to reach

[jira] [Commented] (YARN-8220) Running Tensorflow on YARN with GPU and Docker - Examples

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498270#comment-16498270 ] Wangda Tan commented on YARN-8220: -- Thanks [~eyang] for your comments, For your comments: bq. 1. Avoid

[jira] [Commented] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer

2018-06-01 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16498246#comment-16498246 ] Wangda Tan commented on YARN-7962: -- Thanks [~billie.rinaldi], to me, differences between ver.6 and ver.7

[jira] [Updated] (YARN-8220) Tensorflow yarn spec file to add to native service examples

2018-05-31 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8220: - Description: Tensorflow could be run on YARN and could leverage YARN's distributed features. This spec

[jira] [Commented] (YARN-8384) stdout, stderr logs of a Native Service container is coming with group as nobody

2018-05-31 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497406#comment-16497406 ] Wangda Tan commented on YARN-8384: -- Thanks [~eyang], will commit the patch by tomorrow if no objections.

[jira] [Commented] (YARN-7962) Race Condition When Stopping DelegationTokenRenewer

2018-05-31 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-7962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497368#comment-16497368 ] Wangda Tan commented on YARN-7962: -- The failed tests happened in other JIRAs as well:

[jira] [Commented] (YARN-8349) Remove YARN registry entries when a service is killed by the RM

2018-05-31 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497362#comment-16497362 ] Wangda Tan commented on YARN-8349: -- Gotcha, make sense to me, thanks [~billie.rinaldi]! > Remove YARN

[jira] [Commented] (YARN-8384) stdout, stderr logs of a Native Service container is coming with group as nobody

2018-05-31 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497358#comment-16497358 ] Wangda Tan commented on YARN-8384: -- Thanks [~eyang] for working on this patch. Discussed with [~eyang],

[jira] [Updated] (YARN-8384) stdout, stderr logs of a Native Service container is coming with group as nobody

2018-05-31 Thread Wangda Tan (JIRA)
[ https://issues.apache.org/jira/browse/YARN-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8384: - Target Version/s: 3.2.0, 3.1.1 Priority: Critical (was: Blocker) > stdout, stderr logs of a

<    3   4   5   6   7   8   9   10   11   12   >