[ 
https://issues.apache.org/jira/browse/MESOS-9875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16900866#comment-16900866
 ] 

Greg Mann edited comment on MESOS-9875 at 8/6/19 12:31 PM:
-----------------------------------------------------------

It looks like the {{OPERATION_FINISHED}} update should only be sent after the 
agent fails over and recovers its checkpointed operations. We need to make sure 
that if the agent's call to `syncCheckpointedResources()` fails, which is the 
function that actually creates the persistent volume, then the operation in 
state OPERATION_FINISHED should not be recovered by the agent. Currently, it 
looks like the agent will fail to create the persistent volume, crash, and then 
recover the operation in state OPERATION_FINISHED and send the update.


was (Author: greggomann):
[~jamespeach] could you tell me what Mesos SHA this was observed on? Looking at 
the code, I'm having trouble identifying how this would happen, since we don't 
send the operation feedback until after the operation has been committed to 
disk; feedback is sent to the master via the final call to 
{{operationStatusUpdateManager.update(update);}} in {{Slave::applyOperation()}}.

> Mesos did not respond correctly when operations should fail
> -----------------------------------------------------------
>
>                 Key: MESOS-9875
>                 URL: https://issues.apache.org/jira/browse/MESOS-9875
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>            Reporter: Yifan Xing
>            Assignee: Greg Mann
>            Priority: Major
>              Labels: foundations, mesosphere
>         Attachments: Screen Shot 2019-06-27 at 15.07.20.png
>
>
> For testing persistent volumes with {{OPERATION_FAILED/ERROR}} feedbacks, we 
> sshed into the mesos-agent and made it unable to create subdirectories in 
> {{/srv/mesos/work/volumes}}, however, mesos did not respond any operation 
> failed response. Instead, we received {{OPERATION_FINISHED}} feedback.
> Steps to recreate the issue:
> 1. Ssh into a magent.
>  2. Make it impossible to create a persistent volume (we expect the agent to 
> crash and reregister, and the master to release that the operation is 
> {{OPERATION_DROPPED}}):
>  * cd /srv/mesos/work (if it doesn't exist mkdir /srv/mesos/work/volumes)
>  * chattr -RV +i volumes (then no subdirectories can be created)
> 3. Launch a service with persistent volumes with the constraint of only using 
> the magent modified above.
>  
>  
> Logs for the scheduler for receiving `OPERATION_FINISHED`:
> (Also see screenshot)
>  
> 2019-06-27 21:57:11.879 [12768651|rdar://12768651] 
> [Jarvis-mesos-dispatcher-105] INFO c.a.j.s.ServicePodInstance - Stored 
> operation=4g3k02s1gjb0q_5f912b59-a32d-462c-9c46-8401eba4d2c1 and 
> feedback=OPERATION_FINISHED in podInstanceID=4g3k02s1gjb0q on 
> serviceID=yifan-badagents-1
>  
> * 2019-06-27 21:55:23: task reached state TASK_FAILED for mesos reason: 
> REASON_CONTAINER_LAUNCH_FAILED with mesos message: Failed to launch 
> container: Failed to change the ownership of the persistent volume at 
> '/srv/mesos/work/volumes/roles/test-2/19b564e8-3a90-4f2f-981d-b3dd2a5d9f90' 
> with uid 264 and gid 264: No such file or directory



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to