[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

Greg Mann (JIRA) Tue, 06 Aug 2019 09:13:09 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901216#comment-16901216
 ]


Greg Mann commented on MESOS-9875:
----------------------------------

I'm trying to figure out how to address this with the current information that 
we checkpoint on the agent. The old-style checkpointing on the agent went like 
this:
1) Checkpoint resources to a "target file"
2) Sync checkpointed resources to disk, which creates persistent volumes
3) If #2 succeeds, move the "target file" to the actual checkpoint location

When implementing operation feedback, we thought we could get away without this 
two-phase checkpointing, since we now have the operation feedback streams which 
we can use as another source of information. When recovering in the agent, we 
have some logic which inspects both the checkpointed resources/operations as 
well as the operation streams checkpointed by the operation status update 
manager in order to recover properly.

It's possible that we could use the old-style checkpointed resource files in 
order to accomplish recovery now (we still write those to disk to enable agent 
downgrades), but I'm worried that this will be confusing. But perhaps it's 
already confusing :)

I'll try to have a patch up by EOD with a solution for you to look at.

> Mesos did not respond correctly when operations should fail
> -----------------------------------------------------------
>
>                 Key: MESOS-9875
>                 URL: https://issues.apache.org/jira/browse/MESOS-9875
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Yifan Xing
>            Assignee: Greg Mann
>            Priority: Major
>              Labels: foundations, mesosphere
>         Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9875) Mesos did not respond correctly when operations should fail

Reply via email to