[
https://issues.apache.org/jira/browse/MESOS-8756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gavin updated MESOS-8756:
-------------------------
Comment: was deleted
(was: www.rtat.net)
> Missing reasons for early task failures
> ---------------------------------------
>
> Key: MESOS-8756
> URL: https://issues.apache.org/jira/browse/MESOS-8756
> Project: Mesos
> Issue Type: Bug
> Components: executor, master, scheduler api
> Affects Versions: 1.6.0
> Reporter: A. Dukhovniy
> Priority: Major
> Labels: integration, observability
>
> Some early task failures are not propagated to the framework. Here is an
> example of a marathon pod (mesos containerizer) definition with *a
> non-existing image*:
> {code:java}
> {
> "id": "/fail",
> "containers": [
> {
> "name": "container-1",
> "resources": {
> "cpus": 0.1,
> "mem": 128
> },
> "image": {
> "id": "non-existing-image-56789",
> "kind": "DOCKER"
> }
> }
> ],
> "scaling": {
> "instances": 1,
> "kind": "fixed"
> },
> "networks": [
> {
> "mode": "host"
> }
> ],
> "volumes": [],
> "fetch": [],
> "scheduling": {
> "placement": {
> "constraints": []
> }
> }
> }
> {code}
> Here the status update the framework receives is {{TASK_FAILED (Executor
> terminated)}}.
> Here another example where *a non-existing artifact* is being fetched:
> {code:java}
> {
> "id": "/fail2",
> "containers": [
> {
> "name": "container-1",
> "resources": {
> "cpus": 0.1,
> "mem": 128
> },
> "image": {
> "id": "nginx",
> "kind": "DOCKER",
> "forcePull": false
> },
> "artifacts": [
> {
> "uri": "http://example.com/smth-non-existing-12345.tar.gz"
> }
> ]
> }
> ],
> "scaling": {
> "instances": 1,
> "kind": "fixed"
> },
> "networks": [
> {
> "mode": "host"
> }
> ],
> "volumes": [],
> "fetch": [],
> "scheduling": {
> "placement": {
> "constraints": []
> }
> }
> }
> {code}
> which results in the same status update as above.
> This is not an exhaustive list of such cases. I'm sure there are more
> failures along the fork-chain which are not properly propagated.
> Frameworks (and their users) should always receive meaningful task failures
> reasons no matter where those failures happened. Otherwise, the only way to
> find out what happened is to grep agent logs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)