Ayub Pathan created YUNIKORN-575: ------------------------------------ Summary: Regression: Post restart, Yunikorn tries to recover completed apps and schedules placeholder pods. Key: YUNIKORN-575 URL: https://issues.apache.org/jira/browse/YUNIKORN-575 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Ayub Pathan Attachments: Screen Shot 2021-03-15 at 9.27.10 PM.png, yk_recover.log
* Post restart, YK tries to recover the completed apps and schedules placeholder pods(even though the real pods are in completed state), which may not be needed. This leads to resource mismanagement. {noformat} gang-app-timeout-1006-5jqqk 0/1 Completed 0 69m gang-app-timeout-1007-tw44t 0/1 Completed 0 66m gang-app-timeout-1008-dmzc4 0/1 Completed 0 64m gang-app-timeout-1008-dwxgq 0/1 Completed 0 64m gang-app-timeout-1008-sl2x9 0/1 Completed 0 64m tg-timeout-1006-gang-app-timeout-1006-0 1/1 Running 0 60s tg-timeout-1006-gang-app-timeout-1006-1 1/1 Running 0 60s tg-timeout-1006-gang-app-timeout-1006-2 1/1 Running 0 60s tg-timeout-1007-gang-app-timeout-1007-0 1/1 Running 0 60s tg-timeout-1007-gang-app-timeout-1007-1 1/1 Running 0 60s tg-timeout-1007-gang-app-timeout-1007-2 0/1 Pending 0 60s tg-timeout-1008-gang-app-timeout-1008-0 1/1 Running 0 60s tg-timeout-1008-gang-app-timeout-1008-1 1/1 Running 0 60s tg-timeout-1008-gang-app-timeout-1008-2 1/1 Running 0 60s {noformat} * *All the completed apps are marked as failed, post restart and the allocations are not released. This could be a resource leak post restart.* {noformat} [ { "allocations": null, "applicationID": "gang-app-timeout-1009", "applicationState": "Accepted", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868052062417676, "usedResource": "[]" }, { "allocations": null, "applicationID": "gang-app-timeout-1011", "applicationState": "Accepted", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868052062788287, "usedResource": "[]" }, { "allocations": null, "applicationID": "gang-app-timeout-1010", "applicationState": "Accepted", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868052057156621, "usedResource": "[]" }, { "allocations": null, "applicationID": "gang-app-timeout-1003", "applicationState": "Accepted", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868052062023562, "usedResource": "[]" }, { "allocations": [ { "allocationKey": "0a761a05-4b00-4e34-a54d-22411007553a", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1008", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1008-gang-app-timeout-1008-0" }, "applicationId": "gang-app-timeout-1008", "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "9704811c-422d-4efa-bb42-ab565fb5f16b" }, { "allocationKey": "2505258b-3358-4143-b2a2-9084ffa0977b", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1008", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1008-gang-app-timeout-1008-1" }, "applicationId": "gang-app-timeout-1008", "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "e0ff467d-ec18-4d5b-b981-861835f1604a" }, { "allocationKey": "29dbfaec-7632-4bff-b4ea-e313521497f1", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1008", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1008-gang-app-timeout-1008-2" }, "applicationId": "gang-app-timeout-1008", "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "6723d3ac-c7c8-4935-bb23-3b443909a252" } ], "applicationID": "gang-app-timeout-1008", "applicationState": "Failed", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868050004448061, "usedResource": "[]" }, { "allocations": [ { "allocationKey": "05d87d17-a6dc-4bc0-b495-c76f1cd0a3cb", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1007", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1007-gang-app-timeout-1007-0" }, "applicationId": "gang-app-timeout-1007", "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "67401008-61b0-4957-8361-6d0e8917c21f" }, { "allocationKey": "1af95692-0186-44fe-b712-30edb51b85c2", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1007", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1007-gang-app-timeout-1007-1" }, "applicationId": "gang-app-timeout-1007", "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "5d1f129e-3e40-4103-b2e6-53daf408465f" } ], "applicationID": "gang-app-timeout-1007", "applicationState": "Failed", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868050004840460, "usedResource": "[]" }, { "allocations": [ { "allocationKey": "8524d2ab-a591-4fca-8a5f-3847e8d173ab", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1006", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1006-gang-app-timeout-1006-1" }, "applicationId": "gang-app-timeout-1006", "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "909735f0-607b-4799-bf4c-8b45f59c174b" }, { "allocationKey": "b33078a1-aac6-4217-afd5-3c80248782dd", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1006", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1006-gang-app-timeout-1006-2" }, "applicationId": "gang-app-timeout-1006", "nodeId": "ip-10-192-131-213.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "80f04647-ada2-4851-9361-d6bcb5c18c65" }, { "allocationKey": "e7aa1b09-fac8-43bf-aae9-48215086ae36", "allocationTags": { "kubernetes.io/label/applicationId": "gang-app-timeout-1006", "kubernetes.io/label/queue": "fifo", "kubernetes.io/meta/namespace": "fifo", "kubernetes.io/meta/podName": "tg-timeout-1006-gang-app-timeout-1006-0" }, "applicationId": "gang-app-timeout-1006", "nodeId": "ip-10-192-142-84.ca-central-1.compute.internal", "partition": "default", "priority": "0", "queueName": "root.fifo", "resource": "[memory:300 vcore:300]", "uuid": "f6172318-7e4a-4252-8bf5-8346de4a4d48" } ], "applicationID": "gang-app-timeout-1006", "applicationState": "Failed", "partition": "[mycluster]default", "queueName": "root.fifo", "submissionTime": 1615868050003595376, "usedResource": "[]" } ] {noformat} YK UI snapshot showing apps marked as failed. !image-2021-03-15-21-37-56-129.png|thumbnail! Attached log. [^yk_recover.log] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org