[ 
https://issues.apache.org/jira/browse/SLIDER-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860598#comment-15860598
 ] 

Gour Saha commented on SLIDER-1194:
-----------------------------------

[~sseth] ApplicationDiagnostics now has an attribute _*recentFailedContainers*_ 
which is an array of container ids. Here is an example below. Note, there are 
still some failure scenarios when Yarn sends empty string as diagnostics 
message. Currently I am populating "Container failure info not available from 
Yarn" explicitly when I see empty string. Will file corresponding Yarn bugs for 
these scenarios.

{code}
{
  "finalStatus": "FAILED", 
  "finalMessage": "Unstable Application Instance : - failed with component LLAP 
failed 'recently' 6 times (6 in startup); threshold is 5 - last failure: 
Failure container_e3376_1485898199590_0152_01_000005 on host cn007.example.com 
(0): 
http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000005/ctx/root";,
 
  "recentFailedContainers": [
    "container_e3376_1485898199590_0152_01_000005", 
    "container_e3376_1485898199590_0152_01_000007", 
    "container_e3376_1485898199590_0152_01_000008", 
    "container_e3376_1485898199590_0152_01_000012", 
    "container_e3376_1485898199590_0152_01_000002", 
    "container_e3376_1485898199590_0152_01_000011"
  ],
  "containers": [
    {
      "containerId": "container_e3376_1485898199590_0152_01_000006", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694235773, 
      "startTime": 1486694235871, 
      "completionTime": 1486694294989, 
      "host": "cn005.example.com", 
      "hostURL": "http://cn005.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn005.example.com:45454/container_e3376_1485898199590_0152_01_000006/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000017", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694288833, 
      "startTime": 1486694288990, 
      "completionTime": 1486694294989, 
      "host": "cn006.example.com", 
      "hostURL": "http://cn006.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn006.example.com:45454/container_e3376_1485898199590_0152_01_000017/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000007", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Container failure info not available from Yarn", 
      "createTime": 1486694235773, 
      "startTime": 1486694236259, 
      "completionTime": 1486694287125, 
      "host": "cn005.example.com", 
      "hostURL": "http://cn005.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn005.example.com:45454/container_e3376_1485898199590_0152_01_000007/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000018", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694288832, 
      "startTime": 1486694289107, 
      "completionTime": 1486694294989, 
      "host": "cn009.example.com", 
      "hostURL": "http://cn009.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn009.example.com:45454/container_e3376_1485898199590_0152_01_000018/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000008", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Container failure info not available from Yarn", 
      "createTime": 1486694235773, 
      "startTime": 1486694236042, 
      "completionTime": 1486694286803, 
      "host": "cn006.example.com", 
      "hostURL": "http://cn006.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn006.example.com:45454/container_e3376_1485898199590_0152_01_000008/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000009", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694235773, 
      "startTime": 1486694236150, 
      "completionTime": 1486694294989, 
      "host": "cn006.example.com", 
      "hostURL": "http://cn006.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn006.example.com:45454/container_e3376_1485898199590_0152_01_000009/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000002", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Container failure info not available from Yarn", 
      "createTime": 1486694235761, 
      "startTime": 1486694236950, 
      "completionTime": 1486694287451, 
      "host": "cn008.example.com", 
      "hostURL": "http://cn008.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000002/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000003", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694235773, 
      "startTime": 1486694236725, 
      "completionTime": 1486694294989, 
      "host": "cn008.example.com", 
      "hostURL": "http://cn008.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000003/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000014", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694288830, 
      "startTime": 1486694289240, 
      "completionTime": 1486694294989, 
      "host": "cn005.example.com", 
      "hostURL": "http://cn005.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn005.example.com:45454/container_e3376_1485898199590_0152_01_000014/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000004", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694235773, 
      "startTime": 1486694236392, 
      "completionTime": 1486694294989, 
      "host": "cn007.example.com", 
      "hostURL": "http://cn007.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000004/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000015", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694288830, 
      "startTime": 1486694288849, 
      "completionTime": 1486694294989, 
      "host": "cn007.example.com", 
      "hostURL": "http://cn007.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000015/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000005", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Container failure info not available from Yarn", 
      "createTime": 1486694235773, 
      "startTime": 1486694236509, 
      "completionTime": 1486694294818, 
      "host": "cn007.example.com", 
      "hostURL": "http://cn007.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn007.example.com:45454/container_e3376_1485898199590_0152_01_000005/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000016", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694288832, 
      "startTime": 1486694289349, 
      "completionTime": 1486694294989, 
      "host": "cn008.example.com", 
      "hostURL": "http://cn008.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000016/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000010", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Application stop triggered", 
      "createTime": 1486694235773, 
      "startTime": 1486694236617, 
      "completionTime": 1486694294989, 
      "host": "cn009.example.com", 
      "hostURL": "http://cn009.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn009.example.com:45454/container_e3376_1485898199590_0152_01_000010/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000011", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Container failure info not available from Yarn", 
      "createTime": 1486694235773, 
      "startTime": 1486694236834, 
      "completionTime": 1486694287010, 
      "host": "cn009.example.com", 
      "hostURL": "http://cn009.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn009.example.com:45454/container_e3376_1485898199590_0152_01_000011/ctx/root";
    }, 
    {
      "containerId": "container_e3376_1485898199590_0152_01_000012", 
      "component": "LLAP", 
      "state": 4, 
      "exitCode": 0, 
      "diagnostics": "Container failure info not available from Yarn", 
      "createTime": 1486694237258, 
      "startTime": 1486694237266, 
      "completionTime": 1486694287309, 
      "host": "cn008.example.com", 
      "hostURL": "http://cn008.example.com:8042";, 
      "logLink": 
"http://cn007.example.com:19888/jobhistory/logs/cn008.example.com:45454/container_e3376_1485898199590_0152_01_000012/ctx/root";
    }
  ] 
}
{code}

> If an app fails due to "Too many recent failures" - provide the list of 
> containers which counted towards this
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-1194
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1194
>             Project: Slider
>          Issue Type: Sub-task
>          Components: appmaster, client
>            Reporter: Siddharth Seth
>            Priority: Critical
>             Fix For: Slider 1.0.0
>
>
> All containers is useful, but can start getting really large over time. If an 
> app fails due to too many recent failures - having those containers available 
> in a separate list will be very useful



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to