[ https://issues.apache.org/jira/browse/SLIDER-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860598#comment-15860598 ]
Gour Saha commented on SLIDER-1194: ----------------------------------- [~sseth] ApplicationDiagnostics now has an attribute _*recentFailedContainers*_ which is an array of container ids. Here is an example below. Note, there are still some failure scenarios when Yarn sends empty string as diagnostics message. Currently I am populating "Container failure info not available from Yarn" explicitly when I see empty string. Will file corresponding Yarn bugs for these scenarios. {code} { "finalStatus": "FAILED", "finalMessage": "Unstable Application Instance : - failed with component LLAP failed 'recently' 6 times (6 in startup); threshold is 5 - last failure: Failure container_e3376_1485898199590_0152_01_000005 on host cn007.example.com (0): http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000005/ctx/root", "recentFailedContainers": [ "container_e3376_1485898199590_0152_01_000005", "container_e3376_1485898199590_0152_01_000007", "container_e3376_1485898199590_0152_01_000008", "container_e3376_1485898199590_0152_01_000012", "container_e3376_1485898199590_0152_01_000002", "container_e3376_1485898199590_0152_01_000011" ], "containers": [ { "containerId": "container_e3376_1485898199590_0152_01_000006", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694235773, "startTime": 1486694235871, "completionTime": 1486694294989, "host": "cn005.example.com", "hostURL": "http://cn005.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn005.example.com:45454/container_e3376_1485898199590_0152_01_000006/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000017", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694288833, "startTime": 1486694288990, "completionTime": 1486694294989, "host": "cn006.example.com", "hostURL": "http://cn006.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn006.example.com:45454/container_e3376_1485898199590_0152_01_000017/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000007", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Container failure info not available from Yarn", "createTime": 1486694235773, "startTime": 1486694236259, "completionTime": 1486694287125, "host": "cn005.example.com", "hostURL": "http://cn005.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn005.example.com:45454/container_e3376_1485898199590_0152_01_000007/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000018", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694288832, "startTime": 1486694289107, "completionTime": 1486694294989, "host": "cn009.example.com", "hostURL": "http://cn009.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn009.example.com:45454/container_e3376_1485898199590_0152_01_000018/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000008", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Container failure info not available from Yarn", "createTime": 1486694235773, "startTime": 1486694236042, "completionTime": 1486694286803, "host": "cn006.example.com", "hostURL": "http://cn006.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn006.example.com:45454/container_e3376_1485898199590_0152_01_000008/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000009", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694235773, "startTime": 1486694236150, "completionTime": 1486694294989, "host": "cn006.example.com", "hostURL": "http://cn006.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn006.example.com:45454/container_e3376_1485898199590_0152_01_000009/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000002", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Container failure info not available from Yarn", "createTime": 1486694235761, "startTime": 1486694236950, "completionTime": 1486694287451, "host": "cn008.example.com", "hostURL": "http://cn008.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000002/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000003", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694235773, "startTime": 1486694236725, "completionTime": 1486694294989, "host": "cn008.example.com", "hostURL": "http://cn008.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000003/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000014", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694288830, "startTime": 1486694289240, "completionTime": 1486694294989, "host": "cn005.example.com", "hostURL": "http://cn005.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn005.example.com:45454/container_e3376_1485898199590_0152_01_000014/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000004", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694235773, "startTime": 1486694236392, "completionTime": 1486694294989, "host": "cn007.example.com", "hostURL": "http://cn007.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000004/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000015", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694288830, "startTime": 1486694288849, "completionTime": 1486694294989, "host": "cn007.example.com", "hostURL": "http://cn007.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000015/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000005", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Container failure info not available from Yarn", "createTime": 1486694235773, "startTime": 1486694236509, "completionTime": 1486694294818, "host": "cn007.example.com", "hostURL": "http://cn007.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000005/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000016", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694288832, "startTime": 1486694289349, "completionTime": 1486694294989, "host": "cn008.example.com", "hostURL": "http://cn008.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000016/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000010", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Application stop triggered", "createTime": 1486694235773, "startTime": 1486694236617, "completionTime": 1486694294989, "host": "cn009.example.com", "hostURL": "http://cn009.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn009.example.com:45454/container_e3376_1485898199590_0152_01_000010/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000011", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Container failure info not available from Yarn", "createTime": 1486694235773, "startTime": 1486694236834, "completionTime": 1486694287010, "host": "cn009.example.com", "hostURL": "http://cn009.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn009.example.com:45454/container_e3376_1485898199590_0152_01_000011/ctx/root" }, { "containerId": "container_e3376_1485898199590_0152_01_000012", "component": "LLAP", "state": 4, "exitCode": 0, "diagnostics": "Container failure info not available from Yarn", "createTime": 1486694237258, "startTime": 1486694237266, "completionTime": 1486694287309, "host": "cn008.example.com", "hostURL": "http://cn008.example.com:8042", "logLink": "http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000012/ctx/root" } ] } {code} > If an app fails due to "Too many recent failures" - provide the list of > containers which counted towards this > ------------------------------------------------------------------------------------------------------------- > > Key: SLIDER-1194 > URL: https://issues.apache.org/jira/browse/SLIDER-1194 > Project: Slider > Issue Type: Sub-task > Components: appmaster, client > Reporter: Siddharth Seth > Priority: Critical > Fix For: Slider 1.0.0 > > > All containers is useful, but can start getting really large over time. If an > app fails due to too many recent failures - having those containers available > in a separate list will be very useful -- This message was sent by Atlassian JIRA (v6.3.15#6346)