huangzhir created YUNIKORN-2682: ----------------------------------- Summary: YuniKorn Gang Scheduling Issue: Executors Failing to Start When Running Multiple Applications Key: YUNIKORN-2682 URL: https://issues.apache.org/jira/browse/YUNIKORN-2682 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler Affects Versions: 1.3.0 Reporter: huangzhir Attachments: image-2024-06-19-00-02-53-178.png, image-2024-06-19-00-03-09-703.png
h3. Description: While using YuniKorn's gang scheduling, we encountered a situation where the scheduling process appears to succeed, but in reality, there is a problem. When submitting two applications simultaneously, only the driver pods are successfully running, and the executor pods fail to start due to insufficient resources. The following error is observed in the scheduler logs: {code:java} 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to create placeholder pod {"error": "pods \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota: spark-quota: must specify limits.cpu,limits.memory"} github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99 github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1 github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code} h2. Environment: * YuniKorn version: 1.3.0 * Kubernetes version: 1.21.3 * Spark version: 3.2.2 h2. *resource-quota.yaml* {code:java} apiVersion: v1 kind: ResourceQuota metadata: name: spark-quota namespace: spark spec: hard: requests.cpu: "5" requests.memory: "5Gi" limits.cpu: "5" limits.memory: "5Gi" {code} h2. yunikorn-configs.yaml apiVersion: v1 kind: ConfigMap metadata: name: yunikorn-configs namespace: yunikorn data: log.level: "-1" log.admission.level: "-1" log.core.config.level: "-1" queues.yaml: | partitions: - name: default placementrules: - name: tag value: namespace create: true queues: - name: root submitacl: '*' properties: application.sort.policy: fifo placeholderTimeoutInSeconds: 60 schedulingStyle: Hard queues: - name: spark properties: application.sort.policy: fifo placeholderTimeoutInSeconds: 60 schedulingStyle: Hard resources: guaranteed: vcore: 5 memory: 5Gi max: vcore: 5 memory: 5Gi h2. Spark-submit command {code:java} ./bin/spark-submit \ --master k8s://https://10.10.10.10:6443 \ --deploy-mode cluster \ --name spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \ --conf spark.kubernetes.namespace=spark \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=1 \ --conf spark.executor.cores=1 \ --conf spark.executor.memory=1500m \ --conf spark.driver.cores=1 \ --conf spark.driver.memory=1500m \ --conf spark.kubernetes.driver.limit.cores=1 \ --conf spark.kubernetes.driver.limit.memory=2G \ --conf spark.kubernetes.executor.limit.cores=1 \ --conf spark.kubernetes.executor.limit.memory=2G \ --conf spark.kubernetes.driver.label.app=spark \ --conf spark.kubernetes.executor.label.app=spark \ --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \ --conf spark.kubernetes.scheduler.name=yunikorn \ --conf spark.kubernetes.driver.label.queue=root.spark \ --conf spark.kubernetes.executor.label.queue=root.spark \ --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \ --conf spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \ --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver \ --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": "spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor", "minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": {"app": "spark"} }]' \ --conf spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard' \ --conf spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor \ local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \ 3000 {code} h2. scheduler logs {code:java} 2024-06-18T15:15:26.201Z DEBUG general/general.go:141 pod added {"appType": "general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"} 2024-06-18T15:15:26.201Z DEBUG utils/utils.go:305 Unable to retrieve user name from pod labels. Empty user label {"userLabel": "yunikorn.apache.org/username"} 2024-06-18T15:15:26.201Z DEBUG cache/context.go:737 AddApplication {"Request": {"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}} 2024-06-18T15:15:26.201Z DEBUG cache/context.go:746 app namespace info {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"} 2024-06-18T15:15:26.201Z INFO cache/context.go:773 app added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:26.201Z DEBUG cache/context.go:841 AddTask {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:26.201Z DEBUG cache/context.go:233 adding pod to cache {"podName": "spark-pi-f4f19b902beac663-driver"} 2024-06-18T15:15:26.201Z INFO cache/context.go:863 task added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"} 2024-06-18T15:15:26.201Z INFO cache/context.go:873 app request originating pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original task": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (AddPod.Pre) {"nodes": 3, "pods": 55, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":2,"Running":53}} 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:411 Putting pod in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:26.201Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:26.277Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:26.924Z DEBUG cache/application_state.go:508 shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "New", "destination": "Submitted", "event": "SubmitApplication"} 2024-06-18T15:15:26.924Z INFO cache/application.go:424 handle app submission {"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9, queue: root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted", "clusterID": "mycluster"} 2024-06-18T15:15:26.924Z DEBUG scheduler/scheduler.go:96 enqueued event {"eventType": "*rmevent.RMUpdateApplicationEvent", "event": {"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": \"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": \"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", \"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}}, "currentQueueSize": 0} 2024-06-18T15:15:26.924Z DEBUG placement/placement.go:145 Executing rule for placing application {"ruleName": "tag", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:26.924Z DEBUG placement/tag_rule.go:106 Tag rule intermediate result {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"} 2024-06-18T15:15:26.924Z INFO placement/tag_rule.go:115 Tag rule application placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": "root.spark"} 2024-06-18T15:15:26.924Z DEBUG placement/placement.go:204 Rule result for placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queueName": "root.spark"} 2024-06-18T15:15:26.925Z INFO scheduler/context.go:549 Added application to partition {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "partitionName": "[mycluster]default", "requested queue": "root.spark", "placed queue": "root.spark"} 2024-06-18T15:15:26.925Z DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null}, "currentQueueSize": 0} 2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:108 UpdateApplication callback received {"UpdateApplicationResponse": "accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "} 2024-06-18T15:15:26.925Z DEBUG callback/scheduler_callback.go:114 callback: response to accepted application {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:26.925Z INFO callback/scheduler_callback.go:118 Accepting app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:26.925Z DEBUG cache/application_state.go:508 shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Submitted", "destination": "Accepted", "event": "AcceptApplication"} 2024-06-18T15:15:27.277Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:27.925Z DEBUG cache/application.go:516 postAppAccepted on cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "numTaskGroups": 2, "numAllocatedTasks": 0} 2024-06-18T15:15:27.925Z INFO cache/application.go:526 app has taskGroups defined, trying to reserve resources for gang members {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:27.925Z DEBUG cache/application_state.go:508 shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Accepted", "destination": "Reserving", "event": "TryReserve"} 2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to create placeholder pod {"error": "pods \"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota: spark-quota: must specify limits.cpu,limits.memory"} github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99 github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1 github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:115 start to clean up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:27.933Z INFO cache/placeholder_manager.go:128 finished cleaning up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"} 2024-06-18T15:15:27.933Z DEBUG cache/application_state.go:508 shim app state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Reserving", "destination": "Running", "event": "RunApplication"} 2024-06-18T15:15:28.278Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:28.867Z DEBUG scheduler/partition_manager.go:83 time consumed for queue cleaner {"duration": "6.41µs"} 2024-06-18T15:15:28.925Z INFO cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination": "Pending", "event": "InitTask"} 2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination": "Scheduling", "event": "SubmitTask"} 2024-06-18T15:15:28.926Z DEBUG cache/task.go:275 scheduling pod {"podName": "spark-pi-f4f19b902beac663-driver"} 2024-06-18T15:15:28.926Z DEBUG cache/task.go:294 send update request {"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > > resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" value:<value:1000 > > > maxAllocations:1 tags:<key:\"kubernetes.io/label/app\" value:\"spark\" > tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" > tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" > tags:<key:\"kubernetes.io/label/spark-app-selector\" value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" > tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" > tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" > tags:<key:\"kubernetes.io/meta/podName\" value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\" Originator:true preemptionPolicy:<allowPreemptSelf:true allowPreemptOther:true > > rmID:\"mycluster\" "} 2024-06-18T15:15:28.926Z DEBUG scheduler/scheduler.go:96 enqueued event {"eventType": "*rmevent.RMUpdateAllocationEvent", "event": {"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}}, "currentQueueSize": 0} 2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "New", "destination": "Accepted", "event": "runApplication"} 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]}, "currentQueueSize": 0} 2024-06-18T15:15:28.926Z INFO objects/application.go:668 ask added successfully to application {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "ask": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, "pendingDelta": "map[memory:1975517184 pods:1 vcore:1000]"} 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108 UpdateApplication callback received {"UpdateApplicationResponse": "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" state:\"Accepted\" stateTransitionTimestamp:1718723728926391633 message:\"runApplication\" > "} 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "new status": "Accepted"} 2024-06-18T15:15:28.926Z DEBUG objects/application.go:339 Application state timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state": "Starting", "timeout": "5m0s"} 2024-06-18T15:15:28.926Z INFO objects/application_state.go:133 Application state transition {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "Accepted", "destination": "Starting", "event": "runApplication"} 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event": {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]}, "currentQueueSize": 0} 2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:63 Increasing resource usage {"user": "nobody", "queue path": "root.spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "root"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "root.spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "spark"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"} 2024-06-18T15:15:28.926Z DEBUG ugm/manager.go:257 Group tracker does not exist. Creating group tracker object and linking the same with application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path": "root.spark", "user": "nobody", "group": "nobody"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "root"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "root.spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:45 Creating queue tracker object for queue {"queue": "spark"} 2024-06-18T15:15:28.926Z DEBUG ugm/queue_tracker.go:57 Increasing resource usage {"queue path": "spark", "application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 pods:1 vcore:1000]"} 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:108 UpdateApplication callback received {"UpdateApplicationResponse": "updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" state:\"Starting\" stateTransitionTimestamp:1718723728926635454 message:\"runApplication\" > "} 2024-06-18T15:15:28.926Z DEBUG objects/queue.go:1239 allocation found on queue {"queueName": "root.spark", "appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocation": "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9, uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b, allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66, result=Allocated"} 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:137 status update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "new status": "Starting"} 2024-06-18T15:15:28.926Z INFO scheduler/partition.go:888 scheduler allocation processed {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid": "3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource": "map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false, "targetNode": "10.10.10.66"} 2024-06-18T15:15:28.926Z DEBUG rmproxy/rmproxy.go:59 enqueue event {"eventType": "*rmevent.RMNewAllocationsEvent", "event": {"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]}, "currentQueueSize": 0} 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:48 UpdateAllocation callback received {"UpdateAllocationResponse": "new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\" resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > > resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" value:<value:1000 > > > nodeID:\"10.10.10.66\" applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" taskGroupName:\"spark-driver\" > "} 2024-06-18T15:15:28.926Z DEBUG callback/scheduler_callback.go:53 callback: response to new allocation {"allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID": "3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"} 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:476 Adding assumed pod to cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66", "allBound": true} 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:411 Putting pod in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:28.926Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (AssumePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:28.926Z DEBUG scheduler/context.go:853 Successfully synced shim on new allocation. response: no. of allocations: 1 2024-06-18T15:15:28.926Z INFO cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"} 2024-06-18T15:15:28.926Z DEBUG cache/task.go:349 bind pod volumes {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:28.926Z INFO cache/context.go:499 Binding Pod Volumes skipped: all volumes already bound {"podName": "spark-pi-f4f19b902beac663-driver"} 2024-06-18T15:15:28.926Z DEBUG cache/task.go:362 bind pod {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:28.926Z INFO client/kubeclient.go:112 bind pod to node {"podName": "spark-pi-f4f19b902beac663-driver", "podUID": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"} 2024-06-18T15:15:28.933Z INFO cache/task.go:375 successfully bound pod {"podName": "spark-pi-f4f19b902beac663-driver"} 2024-06-18T15:15:28.934Z INFO cache/task_state.go:380 Task state transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": "spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated", "destination": "Bound", "event": "TaskBound"} 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:411 Putting pod in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:28.934Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:411 Putting pod in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:28.946Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:29.278Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:411 Putting pod in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:29.906Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":3,"Running":53}} 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:411 Putting pod in cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a"} 2024-06-18T15:15:30.113Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdatePod.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":2,"Running":54}} 2024-06-18T15:15:30.279Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:31.279Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:32.280Z DEBUG scheduler/scheduler.go:157 inspect outstanding requests 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdateNode.Pre) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":2,"Running":54}} 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:179 Updating node in cache {"nodeName": "10.10.10.66"} 2024-06-18T15:15:32.395Z DEBUG external/scheduler_cache.go:558 Scheduler cache state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": {"Pending":2,"Running":54}} 2024-06-18T15:15:32.395Z DEBUG cache/node.go:109 set node capacity {"nodeID": "10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\" value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> > resources:<key:\"hugepages-2Mi\" value:<> > resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > > resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > > resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > > resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > > resources:<key:\"memory\" value:<value:64713940992 > > resources:<key:\"pods\" value:<value:110 > > resources:<key:\"vcore\" value:<value:14000 > > "} 2024-06-18T15:15:32.396Z INFO cache/nodes.go:179 Node's ready status flag {"Node name": "10.10.10.66", "ready": true} 2024-06-18T15:15:32.396Z INFO cache/nodes.go:184 report updated nodes to scheduler {"request": {"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}} 2024-06-18T15:15:32.396Z DEBUG scheduler/scheduler.go:96 enqueued event {"eventType": "*rmevent.RMUpdateNodeEvent", "event": {"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}, "currentQueueSize": 0} 2024-06-18T15:15:32.396Z INFO objects/queue.go:1190 updating root queue max resources {"current max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19610 kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574 kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330 vcore:42000]", "new max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 hugepages-2Mi:0 kubernetes.io/batch-cpu:19309 kubernetes.io/batch-memory:109530403859 kubernetes.io/mid-cpu:10574 kubernetes.io/mid-memory:6651391060 memory:194141839360 pods:330 vcore:42000]"} 2024-06-18T15:15:32.396Z DEBUG objects/node.go:182 skip updating occupiedResource, not changed {code} h2. Spark Pod status kubectl get pod -n spark|grep Running !image-2024-06-19-00-02-53-178.png! kubectl describe pod spark-pi-xxxxxxxx-driver -n spark !image-2024-06-19-00-03-09-703.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org