James Peach created MESOS-7963: ---------------------------------- Summary: Task groups can lose the container limitation status. Key: MESOS-7963 URL: https://issues.apache.org/jira/browse/MESOS-7963 Project: Mesos Issue Type: Bug Components: containerization, executor Reporter: James Peach
If you run a single task in a task group and that task fails with a container limitation, that status update can be lost and only the executor failure will be reported to the framework. {noformat} exec /opt/mesos/bin/mesos-execute --content_type=json --master=jpeach.apple.com:5050 '--task_group={ "tasks": [ { "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, "agent_id": {"value" : ""}, "resources": [{ "name": "cpus", "type": "SCALAR", "scalar": { "value": 0.2 } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 32 } }, { "name": "disk", "type": "SCALAR", "scalar": { "value": 2 } } ], "command": { "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M count=64 ; sleep 10000" } } ] }' I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at master@17.228.224.108:5050 Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' Received status update TASK_RUNNING for task '2866368d-7279-4657-b8eb-bf1d968e8ebf' source: SOURCE_EXECUTOR Received status update TASK_FAILED for task '2866368d-7279-4657-b8eb-bf1d968e8ebf' message: 'Command terminated with signal Killed' source: SOURCE_EXECUTOR {noformat} However, the agent logs show that this failed with a memory limitation: {noformat} I0911 11:48:02.235818 7012 http.cpp:532] Processing call WAIT_NESTED_CONTAINER I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container 474388fe-43c3-4372-b903-eaca22740996 I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: 64MB Maximum Used: 64MB ... I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be terminated {noformat} The following {{mesos-execute}} task will show the container limitation correctly: {noformat} exec /opt/mesos/bin/mesos-execute --content_type=json --master=jpeach.apple.com:5050 '--task_group={ "tasks": [ { "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, "agent_id": {"value" : ""}, "resources": [{ "name": "cpus", "type": "SCALAR", "scalar": { "value": 0.2 } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 32 } }], "command": { "value": "sleep 600" } }, { "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d", "task_id": {"value" : "a7571608-3a53-4971-a187-41ed8be183ba"}, "agent_id": {"value" : ""}, "resources": [{ "name": "cpus", "type": "SCALAR", "scalar": { "value": 0.2 } }, { "name": "mem", "type": "SCALAR", "scalar": { "value": 32 } }, { "name": "disk", "type": "SCALAR", "scalar": { "value": 2 } } ], "command": { "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M count=64 ; sleep 10000" } } ] }' I0911 12:29:17.772161 7655 scheduler.cpp:184] Version: 1.5.0 I0911 12:29:17.780640 7661 scheduler.cpp:470] New master detected at master@17.228.224.108:5050 Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0011 Submitted task group with tasks [ 1372b2e2-c501-4e80-bcbd-1a5c5194e206, a7571608-3a53-4971-a187-41ed8be183ba ] to agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' Received status update TASK_RUNNING for task '1372b2e2-c501-4e80-bcbd-1a5c5194e206' source: SOURCE_EXECUTOR Received status update TASK_RUNNING for task 'a7571608-3a53-4971-a187-41ed8be183ba' source: SOURCE_EXECUTOR Received status update TASK_FAILED for task '1372b2e2-c501-4e80-bcbd-1a5c5194e206' message: 'Command terminated with signal Killed' source: SOURCE_EXECUTOR Received status update TASK_FAILED for task 'a7571608-3a53-4971-a187-41ed8be183ba' message: 'Disk usage (65556KB) exceeds quota (34MB)' source: SOURCE_AGENT reason: REASON_CONTAINER_LIMITATION_DISK {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)