[ 
https://issues.apache.org/jira/browse/MESOS-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gavin updated MESOS-8756:
-------------------------
    Comment: was deleted

(was: www.rtat.net)

> Missing reasons for early task failures
> ---------------------------------------
>
>                 Key: MESOS-8756
>                 URL: https://issues.apache.org/jira/browse/MESOS-8756
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor, master, scheduler api
>    Affects Versions: 1.6.0
>            Reporter: A. Dukhovniy
>            Priority: Major
>              Labels: integration, observability
>
> Some early task failures are not propagated to the framework. Here is an 
> example of a marathon pod (mesos containerizer) definition with *a 
> non-existing image*:
> {code:java}
> {
>   "id": "/fail",
>   "containers": [
>     {
>       "name": "container-1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 128
>       },
>       "image": {
>         "id": "non-existing-image-56789",
>         "kind": "DOCKER"
>       }
>     }
>   ],
>   "scaling": {
>     "instances": 1,
>     "kind": "fixed"
>   },
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [],
>   "fetch": [],
>   "scheduling": {
>     "placement": {
>       "constraints": []
>     }
>   }
> }
> {code}
> Here the status update the framework receives is {{TASK_FAILED (Executor 
> terminated)}}.
> Here another example where *a non-existing artifact* is being fetched:
> {code:java}
> {
>   "id": "/fail2",
>   "containers": [
>     {
>       "name": "container-1",
>       "resources": {
>         "cpus": 0.1,
>         "mem": 128
>       },
>       "image": {
>         "id": "nginx",
>         "kind": "DOCKER",
>         "forcePull": false
>       },
>       "artifacts": [
>         {
>           "uri": "http://example.com/smth-non-existing-12345.tar.gz";
>         }
>       ]
>     }
>   ],
>   "scaling": {
>     "instances": 1,
>     "kind": "fixed"
>   },
>   "networks": [
>     {
>       "mode": "host"
>     }
>   ],
>   "volumes": [],
>   "fetch": [],
>   "scheduling": {
>     "placement": {
>       "constraints": []
>     }
>   }
> }
> {code}
> which results in the same status update as above.
> This is not an exhaustive list of such cases. I'm sure there are more 
> failures along the fork-chain which are not properly propagated. 
> Frameworks (and their users) should always receive meaningful task failures 
> reasons no matter where those failures happened. Otherwise, the only way to 
> find out what happened is to grep agent logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to