Joseph Wu created MESOS-9977:
--------------------------------

             Summary: Agent does not check for immutable files while removing 
persistent volumes (and possibly in other GC operations)
                 Key: MESOS-9977
                 URL: https://issues.apache.org/jira/browse/MESOS-9977
             Project: Mesos
          Issue Type: Bug
          Components: agent
    Affects Versions: 1.9.0, 1.8.1, 1.7.2, 1.6.2
            Reporter: Joseph Wu


We observed an exit/crash loop on an agent originating from deleting a 
persistent volume:
{code}
slave.cpp:4557] Deleting persistent volume '<UUID>' at 
'/path/to/mesos/slave/volumes/roles/my-role/<UUID>'
{code}

This persistent volume happened to have one (or more) files within marked as 
{{immutable}}.

When the agent went to delete this persistent volume, via {{os::rmdir(...)}}, 
it encountered these immutable file(s) and exits like:
{code}
slave.cpp:4423] EXIT with status 1: Failed to sync checkpointed resources: 
Failed to remove persistent volume '<UUID>' at 
'/path/to/mesos/slave/volumes/roles/my-role/<UUID>': Operation not permitted
{code}

The agent would then be unable to start up again, because during recovery, the 
agent would attempt to delete the same persistent volume and fail to do so.

Manually removing the immutable attribute from files within the persistent 
volume allows the agent to recover:
{code}
chattr -R -i /path/to/mesos/slave/volumes/roles/my-role/<UUID>
{code}

Immutable attributes can be easily introduced by any tasks running on the 
agent.  As long as the task has sufficient permissions, it could easily call 
{{chattr +i ...}}.  This attribute could also affect sandbox GC, which also 
uses {{os::rmdir}} to clean up.  However, sandbox GC tends to warn rather than 
exit on failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to