[ https://issues.apache.org/jira/browse/MESOS-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chun-Hung Hsiao updated MESOS-7939: ----------------------------------- Summary: Early disk usage check for garbage collection during recovery (was: Early disk usage check for garbage collection ) > Early disk usage check for garbage collection during recovery > ------------------------------------------------------------- > > Key: MESOS-7939 > URL: https://issues.apache.org/jira/browse/MESOS-7939 > Project: Mesos > Issue Type: Bug > Components: agent > Reporter: Chun-Hung Hsiao > Assignee: Chun-Hung Hsiao > Priority: Critical > Fix For: 1.4.1 > > > Currently the default value for `disk_watch_interval` is 1 minute. This is > not fast enough and could lead to the following scenario: > 1. The disk usage was checked and there was not enough headroom: > {noformat} > I0901 17:54:33.000000 25510 slave.cpp:5896] Current disk usage 99.87%. Max > allowed age: 0ns > {noformat} > But no container was pruned because no container had been scheduled for GC. > 2. A task was completed. The task itself contained a lot of nested > containers, each used a lot of disk space. Note that there is no way for > Mesos agent to schedule individual nested containers for GC since nested > containers are not necessarily tied to tasks. When the top-lovel container is > completed, it was scheduled for GC, and the nested containers would be GC'ed > as well: > {noformat} > I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling > '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc' > for gc 1.99999466483852days in the future > I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling > '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e' > for gc 1.99999466405037days in the future > I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling > '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e/runs/5e70adb1-939e-4d0f-a513-0f77704620bc' > for gc 1.9999946635763days in the future > I0901 17:54:44.000000 25510 gc.cpp:59] Scheduling > '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__81953586-6f33-4abf-921d-2bba0481836e' > for gc 1.99999466324148days in the future > {noformat} > 3. Since the next disk usage check was still 40ish seconds away, no GC was > performed even though the disk was full. As a result, Mesos agent failed to > checkpoint the task status: > {noformat} > I0901 17:54:49.000000 25513 status_update_manager.cpp:323] Received status > update TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task > node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework > 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005 > F0901 17:54:49.000000 25513 slave.cpp:4748] CHECK_READY(future): is FAILED: > Failed to open > '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates' > for status updates: No space left on device Failed to handle status update > TASK_FAILED (UUID: bf24c3da-db23-4c82-a09f-a3b859e8cad4) for task > node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework > 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005 > {noformat} > 4. When the agent restarted, it tried to checkpoint the task status again. > However, since the first disk usage check was scheduled 1 minute after > startup, the agent failed before GC kicked in, falling into a restart failure > loop: > {noformat} > F0901 17:55:06.000000 31114 slave.cpp:4748] CHECK_READY(future): is FAILED: > Failed to open > '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S5/frameworks/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005/executors/node__4ae69c7c-e32e-41d2-a485-88145a3e385c/runs/602befac-3ff5-44d7-acac-aeebdc0e4666/tasks/node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84/task.updates' > for status updates: No space left on device Failed to handle status update > TASK_FAILED (UUID: fb9c3951-9a93-4925-a7f0-9ba7e38d2398) for task > node-0-server__e5e468a3-b2ee-42ee-80e8-edc19a3aef84 of framework > 9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-0005 > {noformat} > We should kick in GC early, so the agent can recover from this state. > Related ticket: MESOS-7031 -- This message was sent by Atlassian JIRA (v6.4.14#64029)