[ https://issues.apache.org/jira/browse/YUNIKORN-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791474#comment-17791474 ]
Qi Zhu commented on YUNIKORN-1706: ---------------------------------- Agree [~wilfreds] , i will try to reproduce this after the re-work by Craig > We should clean up failed apps in shim side > ------------------------------------------- > > Key: YUNIKORN-1706 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1706 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes > Reporter: Wei Huang > Assignee: Qi Zhu > Priority: Major > Labels: pull-request-available > > I'm running a local dev env *make run_plugin* based on 1.2.0, no admission > controller is configured. Additionally, I configured a configmap in the > default namespace: > {code:bash} > apiVersion: v1 > data: > queues.yaml: | > partitions: > - name: default > nodesortpolicy: > type: binpacking > queues: > - name: root > submitacl: '*' > queues: > - name: app1 > submitacl: '*' > properties: > application.sort.policy: fifo > resources: > max: > {memory: 200G, vcore: 1} > kind: ConfigMap > metadata: > name: yunikorn-configs > {code} > Then I create a Pod with the following config: > {code:bash} > kind: Pod > apiVersion: v1 > metadata: > name: pod-1 > labels: > applicationId: "app1" > spec: > schedulerName: yunikorn > containers: > - name: pause > image: registry.k8s.io/pause:3.6 > resources: > requests: > cpu: 1 > {code} > The pod cannot be scheduled with a status {*}ApplicationRejected{*}, and I > observed log in the shim as: > {code:bash} > 2023-04-21T16:34:42.354-0700 INFO cache/context.go:741 app added > {"appID": "app1"} > 2023-04-21T16:34:42.354-0700 INFO cache/context.go:831 task added > {"appID": "app1", "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", > "taskState": "New"} > 2023-04-21T16:34:42.355-0700 INFO cache/context.go:841 app request > originating pod added {"appID": "app1", "original task": > "d643a5ad-c93b-4d99-8eac-9418fbac18b0"} > I0421 16:34:42.355111 46423 factory.go:344] "Unable to schedule pod; no > fit; waiting" pod="default/pod-1" err="0/1 nodes are available: 1 Pod is not > ready for scheduling." > 2023-04-21T16:34:42.689-0700 INFO cache/application.go:413 handle > app submission {"app": "applicationID: app1, queue: root.sandbox, > partition: default, totalNumOfTasks: 1, currentState: Submitted", > "clusterID": "mycluster"} > 2023-04-21T16:34:42.692-0700 INFO objects/application_state.go:132 > Application state transition {"appID": "app1", "source": "New", > "destination": "Rejected", "event": "rejectApplication"} > 2023-04-21T16:34:42.692-0700 ERROR scheduler/context.go:540 Failed > to add application to partition (placement rejected) {"applicationID": > "app1", "partitionName": "[mycluster]default", "error": "application 'app1' > rejected, cannot create queue 'root.sandbox' without placement rules"} > github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateApplicationEvent > > /Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/context.go:540 > github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent > > /Users/weih/go/src/github.pie.apple.com/apache/yunikorn-k8shim/vendor/github.com/apache/yunikorn-core/pkg/scheduler/scheduler.go:113 > 2023-04-21T16:34:42.693-0700 INFO cache/application.go:565 app is > rejected by scheduler {"appID": "app1"} > 2023-04-21T16:34:42.693-0700 INFO cache/application.go:598 > failApplication reason {"applicationID": "app1", "errMsg": > "ApplicationRejected: application 'app1' rejected, cannot create queue > 'root.sandbox' without placement rules"} > 2023-04-21T16:34:42.694-0700 INFO cache/application.go:585 setting > pod to failed {"podName": "pod-1"} > 2023-04-21T16:34:42.712-0700 INFO general/general.go:179 task completes > {"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": > "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "podStatus": "Failed"} > 2023-04-21T16:34:42.714-0700 INFO client/kubeclient.go:246 > Successfully updated pod status {"namespace": "default", "podName": "pod-1", > "newStatus": "&PodStatus{Phase:Failed,Conditions:[]PodCondition{},Message: > application 'app1' rejected, cannot create queue 'root.sandbox' without > placement > rules,Reason:ApplicationRejected,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{},QOSClass:,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"} > 2023-04-21T16:34:42.714-0700 INFO cache/application.go:590 new pod > status {"status": "Failed"} > 2023-04-21T16:34:42.714-0700 INFO cache/task.go:543 releasing > allocations {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 0} > 2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:115 > start to clean up app placeholders {"appID": "app1"} > 2023-04-21T16:34:42.714-0700 INFO cache/placeholder_manager.go:128 > finished cleaning up app placeholders {"appID": "app1"} > 2023-04-21T16:34:42.714-0700 INFO scheduler/partition.go:1343 Invalid > ask release requested by shim {"appID": "app1", "ask": > "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "terminationType": > "UNKNOWN_TERMINATION_TYPE"} > 2023-04-21T16:34:42.714-0700 INFO cache/task_state.go:372 object > transition {"object": {}, "source": "New", "destination": "Completed", > "event": "CompleteTask"} > {code} > Then I deleted the pod, and noticed the log shows: > {code:bash} > 2023-04-21T16:35:09.598-0700 INFO general/general.go:213 delete pod > {"appType": "general", "namespace": "default", "podName": "pod-1", "podUID": > "d643a5ad-c93b-4d99-8eac-9418fbac18b0"} > 2023-04-21T16:35:09.598-0700 WARN cache/task.go:528 task allocation > UUID is empty, sending this release request to yunikorn-core could cause all > allocations of this app get released. skip this request, this may cause some > resource leak. check the logs for more info! {"applicationID": "app1", > "taskID": "d643a5ad-c93b-4d99-8eac-9418fbac18b0", "taskAlias": > "default/pod-1", "allocationUUID": "", "task": "Completed"} > {code} > Then if I recreated the same pod by just appending the queue label: > {code:bash} > queue: root.app1 > {code} > The pod is still unschedulable and remains the status forever. And the only > solution to make it schedulable is to restart shim. > Is it a bug? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org