huangzhir created YUNIKORN-2682:
-----------------------------------

             Summary: YuniKorn Gang Scheduling Issue: Executors Failing to 
Start When Running Multiple Applications
                 Key: YUNIKORN-2682
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2682
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
    Affects Versions: 1.3.0
            Reporter: huangzhir
         Attachments: image-2024-06-19-00-02-53-178.png, 
image-2024-06-19-00-03-09-703.png

h3. Description:

While using YuniKorn's gang scheduling, we encountered a situation where the 
scheduling process appears to succeed, but in reality, there is a problem. When 
submitting two applications simultaneously, only the driver pods are 
successfully running, and the executor pods fail to start due to insufficient 
resources. The following error is observed in the scheduler logs:
{code:java}
2024-06-18T15:15:27.933Z ERROR cache/placeholder_manager.go:99 failed to create 
placeholder pod {"error": "pods 
\"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota: 
spark-quota: must specify limits.cpu,limits.memory"}
github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
  github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
  github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542 {code}
h2. Environment:
 * YuniKorn version: 1.3.0
 * Kubernetes version: 1.21.3
 * Spark version: 3.2.2

 
h2. *resource-quota.yaml*
{code:java}
apiVersion: v1
kind: ResourceQuota
metadata:
  name: spark-quota
  namespace: spark
spec:
  hard:
    requests.cpu: "5"
    requests.memory: "5Gi"
    limits.cpu: "5"
    limits.memory: "5Gi" {code}
h2. yunikorn-configs.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: yunikorn-configs
  namespace: yunikorn
data:
  log.level: "-1"
  log.admission.level: "-1"
  log.core.config.level: "-1"
  queues.yaml: |
    partitions:
      - name: default
        placementrules:
          - name: tag
            value: namespace
            create: true
        queues:
          - name: root
            submitacl: '*'
            properties:
              application.sort.policy: fifo
              placeholderTimeoutInSeconds: 60
              schedulingStyle: Hard
            queues:
              - name: spark
                properties:
                  application.sort.policy: fifo
                  placeholderTimeoutInSeconds: 60
                  schedulingStyle: Hard
                resources:
                  guaranteed:
                    vcore: 5
                    memory: 5Gi
                  max:
                    vcore: 5
                    memory: 5Gi
 
h2. Spark-submit command

 
{code:java}
./bin/spark-submit \
  --master k8s://https://10.10.10.10:6443 \
  --deploy-mode cluster \
  --name spark-pi \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=sparksa \
  --conf spark.kubernetes.namespace=spark \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=1 \
  --conf spark.executor.cores=1 \
  --conf spark.executor.memory=1500m \
  --conf spark.driver.cores=1 \
  --conf spark.driver.memory=1500m \
  --conf spark.kubernetes.driver.limit.cores=1 \
  --conf spark.kubernetes.driver.limit.memory=2G \
  --conf spark.kubernetes.executor.limit.cores=1 \
  --conf spark.kubernetes.executor.limit.memory=2G \
   --conf spark.kubernetes.driver.label.app=spark \
  --conf spark.kubernetes.executor.label.app=spark \
  --conf spark.kubernetes.container.image=apache/spark:v3.3.2 \
  --conf spark.kubernetes.scheduler.name=yunikorn \
  --conf spark.kubernetes.driver.label.queue=root.spark \
  --conf spark.kubernetes.executor.label.queue=root.spark \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
  --conf 
spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={{APP_ID}} \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-group-name=spark-driver
 \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/task-groups='[{"name": 
"spark-driver", "minMember": 1, "minResource": {"cpu": "1", "memory": 
"2Gi"},"nodeSelector": {"app": "spark"} }, {"name": "spark-executor", 
"minMember": 1, "minResource": {"cpu": "1", "memory": "2Gi"},"nodeSelector": 
{"app": "spark"} }]' \
  --conf 
spark.kubernetes.driver.annotation.yunikorn.apache.org/schedulingPolicyParameters='placeholderTimeoutInSeconds=60
 gangSchedulingStyle=Hard' \
  --conf 
spark.kubernetes.executor.annotation.yunikorn.apache.org/task-group-name=spark-executor
 \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.3.2.jar \
  3000 {code}
 
h2. scheduler logs

 
{code:java}
2024-06-18T15:15:26.201Z  DEBUG general/general.go:141  pod added {"appType": 
"general", "Name": "spark-pi-f4f19b902beac663-driver", "Namespace": "spark"}
2024-06-18T15:15:26.201Z  DEBUG utils/utils.go:305  Unable to retrieve user 
name from pod labels. Empty user label  {"userLabel": 
"yunikorn.apache.org/username"}
2024-06-18T15:15:26.201Z  DEBUG cache/context.go:737  AddApplication  
{"Request": 
{"Metadata":{"ApplicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","QueueName":"root.spark","User":"nobody","Tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
 gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": 
\"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": 
\"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} 
}]"},"Groups":null,"TaskGroups":[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}},{"name":"spark-executor","minMember":1,"minResource":{"cpu":"1","memory":"2Gi"},"nodeSelector":{"app":"spark"}}],"OwnerReferences":[{"apiVersion":"v1","kind":"Pod","name":"spark-pi-f4f19b902beac663-driver","uid":"4fb897dd-3af5-4799-a09d-640b5222ba3a","controller":false,"blockOwnerDeletion":true}],"SchedulingPolicyParameters":{},"CreationTime":0}}}
2024-06-18T15:15:26.201Z  DEBUG cache/context.go:746  app namespace info  
{"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "namespace": "spark"}
2024-06-18T15:15:26.201Z  INFO  cache/context.go:773  app added {"appID": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.201Z  DEBUG cache/context.go:841  AddTask {"appID": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:26.201Z  DEBUG cache/context.go:233  adding pod to cache 
{"podName": "spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:26.201Z  INFO  cache/context.go:863  task added  {"appID": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "taskID": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskState": "New"}
2024-06-18T15:15:26.201Z  INFO  cache/context.go:873  app request originating 
pod added {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "original task": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (AddPod.Pre)  {"nodes": 3, "pods": 55, "assumed": 0, "pendingAllocs": 0, 
"inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":2,"Running":53}}
2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:411 Putting pod in 
cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:26.201Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (AddPod.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 0, 
"inProgressAllocs": 0, "podsAssigned": 55, "phases": {"Pending":3,"Running":53}}
2024-06-18T15:15:26.277Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:26.924Z  DEBUG cache/application_state.go:508  shim app state 
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": "New", 
"destination": "Submitted", "event": "SubmitApplication"}
2024-06-18T15:15:26.924Z  INFO  cache/application.go:424  handle app submission 
{"app": "applicationID: spark-8e410a4c5ce44da2aa85ba835257a1e9, queue: 
root.spark, partition: default, totalNumOfTasks: 1, currentState: Submitted", 
"clusterID": "mycluster"}
2024-06-18T15:15:26.924Z  DEBUG scheduler/scheduler.go:96 enqueued event  
{"eventType": "*rmevent.RMUpdateApplicationEvent", "event": 
{"Request":{"new":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","queueName":"root.spark","partitionName":"[mycluster]default","ugi":{"user":"nobody"},"tags":{"namespace":"spark","yunikorn.apache.org/schedulingPolicyParameters":"placeholderTimeoutInSeconds=60
 gangSchedulingStyle=Hard","yunikorn.apache.org/task-groups":"[{\"name\": 
\"spark-driver\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} }, {\"name\": 
\"spark-executor\", \"minMember\": 1, \"minResource\": {\"cpu\": \"1\", 
\"memory\": \"2Gi\"},\"nodeSelector\": {\"app\": \"spark\"} 
}]"},"executionTimeoutMilliSeconds":60000,"placeholderAsk":{"resources":{"memory":{"value":4294967296},"pods":{"value":2},"vcore":{"value":2000}}},"gangSchedulingStyle":"Hard"}],"rmID":"mycluster"}},
 "currentQueueSize": 0}
2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:145  Executing rule for 
placing application  {"ruleName": "tag", "application": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.924Z  DEBUG placement/tag_rule.go:106 Tag rule intermediate 
result  {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": 
"root.spark"}
2024-06-18T15:15:26.924Z  INFO  placement/tag_rule.go:115 Tag rule application 
placed {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue": 
"root.spark"}
2024-06-18T15:15:26.924Z  DEBUG placement/placement.go:204  Rule result for 
placing application {"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"queueName": "root.spark"}
2024-06-18T15:15:26.925Z  INFO  scheduler/context.go:549  Added application to 
partition  {"applicationID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"partitionName": "[mycluster]default", "requested queue": "root.spark", "placed 
queue": "root.spark"}
2024-06-18T15:15:26.925Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
{"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
{"RmID":"mycluster","AcceptedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9"}],"RejectedApplications":[],"UpdatedApplications":null},
 "currentQueueSize": 0}
2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:108  
UpdateApplication callback received {"UpdateApplicationResponse": 
"accepted:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > "}
2024-06-18T15:15:26.925Z  DEBUG callback/scheduler_callback.go:114  callback: 
response to accepted application  {"appID": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.925Z  INFO  callback/scheduler_callback.go:118  Accepting 
app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:26.925Z  DEBUG cache/application_state.go:508  shim app state 
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
"Submitted", "destination": "Accepted", "event": "AcceptApplication"}
2024-06-18T15:15:27.277Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:27.925Z  DEBUG cache/application.go:516  postAppAccepted on 
cached app {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "numTaskGroups": 
2, "numAllocatedTasks": 0}
2024-06-18T15:15:27.925Z  INFO  cache/application.go:526  app has taskGroups 
defined, trying to reserve resources for gang members  {"appID": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:27.925Z  DEBUG cache/application_state.go:508  shim app state 
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
"Accepted", "destination": "Reserving", "event": "TryReserve"}
2024-06-18T15:15:27.933Z  ERROR cache/placeholder_manager.go:99 failed to 
create placeholder pod  {"error": "pods 
\"tg-spark-driver-spark-8e410a4c5ce44da2aa85ba-0\" is forbidden: failed quota: 
spark-quota: must specify limits.cpu,limits.memory"}
github.com/apache/yunikorn-k8shim/pkg/cache.(*PlaceholderManager).createAppPlaceholders
  github.com/apache/yunikorn-k8shim/pkg/cache/placeholder_manager.go:99
github.com/apache/yunikorn-k8shim/pkg/cache.(*Application).onReserving.func1
  github.com/apache/yunikorn-k8shim/pkg/cache/application.go:542
2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:115  start to 
clean up app placeholders  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:27.933Z  INFO  cache/placeholder_manager.go:128  finished 
cleaning up app placeholders {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9"}
2024-06-18T15:15:27.933Z  DEBUG cache/application_state.go:508  shim app state 
transition {"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
"Reserving", "destination": "Running", "event": "RunApplication"}
2024-06-18T15:15:28.278Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:28.867Z  DEBUG scheduler/partition_manager.go:83 time consumed 
for queue cleaner {"duration": "6.41µs"}
2024-06-18T15:15:28.925Z  INFO  cache/task_state.go:380 Task state transition 
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
"spark/spark-pi-f4f19b902beac663-driver", "source": "New", "destination": 
"Pending", "event": "InitTask"}
2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition 
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
"spark/spark-pi-f4f19b902beac663-driver", "source": "Pending", "destination": 
"Scheduling", "event": "SubmitTask"}
2024-06-18T15:15:28.926Z  DEBUG cache/task.go:275 scheduling pod  {"podName": 
"spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:28.926Z  DEBUG cache/task.go:294 send update request 
{"request": "asks:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" 
applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
resourceAsk:<resources:<key:\"memory\" value:<value:1975517184 > > 
resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" 
value:<value:1000 > > > maxAllocations:1 tags:<key:\"kubernetes.io/label/app\" 
value:\"spark\" > tags:<key:\"kubernetes.io/label/queue\" value:\"root.spark\" 
> tags:<key:\"kubernetes.io/label/spark-app-name\" value:\"spark-pi\" > 
tags:<key:\"kubernetes.io/label/spark-app-selector\" 
value:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" > 
tags:<key:\"kubernetes.io/label/spark-role\" value:\"driver\" > 
tags:<key:\"kubernetes.io/label/spark-version\" value:\"3.3.2\" > 
tags:<key:\"kubernetes.io/meta/namespace\" value:\"spark\" > 
tags:<key:\"kubernetes.io/meta/podName\" 
value:\"spark-pi-f4f19b902beac663-driver\" > taskGroupName:\"spark-driver\" 
Originator:true preemptionPolicy:<allowPreemptSelf:true allowPreemptOther:true 
> > rmID:\"mycluster\" "}
2024-06-18T15:15:28.926Z  DEBUG scheduler/scheduler.go:96 enqueued event  
{"eventType": "*rmevent.RMUpdateAllocationEvent", "event": 
{"Request":{"asks":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","partitionName":"[mycluster]default","resourceAsk":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"maxAllocations":1,"tags":{"kubernetes.io/label/app":"spark","kubernetes.io/label/queue":"root.spark","kubernetes.io/label/spark-app-name":"spark-pi","kubernetes.io/label/spark-app-selector":"spark-8e410a4c5ce44da2aa85ba835257a1e9","kubernetes.io/label/spark-role":"driver","kubernetes.io/label/spark-version":"3.3.2","kubernetes.io/meta/namespace":"spark","kubernetes.io/meta/podName":"spark-pi-f4f19b902beac663-driver"},"taskGroupName":"spark-driver","Originator":true,"preemptionPolicy":{"allowPreemptSelf":true,"allowPreemptOther":true}}],"rmID":"mycluster"}},
 "currentQueueSize": 0}
2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application 
state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
"New", "destination": "Accepted", "event": "runApplication"}
2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
{"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
{"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Accepted","stateTransitionTimestamp":1718723728926391633,"message":"runApplication"}]},
 "currentQueueSize": 0}
2024-06-18T15:15:28.926Z  INFO  objects/application.go:668  ask added 
successfully to application {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"ask": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "placeholder": false, 
"pendingDelta": "map[memory:1975517184 pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  
UpdateApplication callback received {"UpdateApplicationResponse": 
"updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
state:\"Accepted\" stateTransitionTimestamp:1718723728926391633 
message:\"runApplication\" > "}
2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status 
update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"new status": "Accepted"}
2024-06-18T15:15:28.926Z  DEBUG objects/application.go:339  Application state 
timer initiated {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "state": 
"Starting", "timeout": "5m0s"}
2024-06-18T15:15:28.926Z  INFO  objects/application_state.go:133  Application 
state transition  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "source": 
"Accepted", "destination": "Starting", "event": "runApplication"}
2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
{"eventType": "*rmevent.RMApplicationUpdateEvent", "event": 
{"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","state":"Starting","stateTransitionTimestamp":1718723728926635454,"message":"runApplication"}]},
 "currentQueueSize": 0}
2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:63 Increasing resource usage 
{"user": "nobody", "queue path": "root.spark", "application": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker 
object for queue {"queue": "root"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
usage {"queue path": "root.spark", "application": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker 
object for queue {"queue": "spark"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
usage {"queue path": "spark", "application": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z  DEBUG ugm/manager.go:257  Group tracker does not 
exist. Creating group tracker object and linking the same with application 
{"application": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "queue path": 
"root.spark", "user": "nobody", "group": "nobody"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker 
object for queue {"queue": "root"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
usage {"queue path": "root.spark", "application": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:45 Creating queue tracker 
object for queue {"queue": "spark"}
2024-06-18T15:15:28.926Z  DEBUG ugm/queue_tracker.go:57 Increasing resource 
usage {"queue path": "spark", "application": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "resource": "map[memory:1975517184 
pods:1 vcore:1000]"}
2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:108  
UpdateApplication callback received {"UpdateApplicationResponse": 
"updated:<applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
state:\"Starting\" stateTransitionTimestamp:1718723728926635454 
message:\"runApplication\" > "}
2024-06-18T15:15:28.926Z  DEBUG objects/queue.go:1239 allocation found on queue 
{"queueName": "root.spark", "appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"allocation": "applicationID=spark-8e410a4c5ce44da2aa85ba835257a1e9, 
uuid=3e42c4e0-0fc3-466c-a524-444b2eec700b, 
allocationKey=4fb897dd-3af5-4799-a09d-640b5222ba3a, Node=10.10.10.66, 
result=Allocated"}
2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:137  status 
update callback received {"appId": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"new status": "Starting"}
2024-06-18T15:15:28.926Z  INFO  scheduler/partition.go:888  scheduler 
allocation processed  {"appID": "spark-8e410a4c5ce44da2aa85ba835257a1e9", 
"allocationKey": "4fb897dd-3af5-4799-a09d-640b5222ba3a", "uuid": 
"3e42c4e0-0fc3-466c-a524-444b2eec700b", "allocatedResource": 
"map[memory:1975517184 pods:1 vcore:1000]", "placeholder": false, "targetNode": 
"10.10.10.66"}
2024-06-18T15:15:28.926Z  DEBUG rmproxy/rmproxy.go:59 enqueue event 
{"eventType": "*rmevent.RMNewAllocationsEvent", "event": 
{"RmID":"mycluster","Allocations":[{"allocationKey":"4fb897dd-3af5-4799-a09d-640b5222ba3a","UUID":"3e42c4e0-0fc3-466c-a524-444b2eec700b","resourcePerAlloc":{"resources":{"memory":{"value":1975517184},"pods":{"value":1},"vcore":{"value":1000}}},"nodeID":"10.10.10.66","applicationID":"spark-8e410a4c5ce44da2aa85ba835257a1e9","taskGroupName":"spark-driver"}]},
 "currentQueueSize": 0}
2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:48 
UpdateAllocation callback received  {"UpdateAllocationResponse": 
"new:<allocationKey:\"4fb897dd-3af5-4799-a09d-640b5222ba3a\" 
UUID:\"3e42c4e0-0fc3-466c-a524-444b2eec700b\" 
resourcePerAlloc:<resources:<key:\"memory\" value:<value:1975517184 > > 
resources:<key:\"pods\" value:<value:1 > > resources:<key:\"vcore\" 
value:<value:1000 > > > nodeID:\"10.10.10.66\" 
applicationID:\"spark-8e410a4c5ce44da2aa85ba835257a1e9\" 
taskGroupName:\"spark-driver\" > "}
2024-06-18T15:15:28.926Z  DEBUG callback/scheduler_callback.go:53 callback: 
response to new allocation  {"allocationKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "UUID": 
"3e42c4e0-0fc3-466c-a524-444b2eec700b", "applicationID": 
"spark-8e410a4c5ce44da2aa85ba835257a1e9", "nodeID": "10.10.10.66"}
2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (AssumePod.Pre) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 55, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:476 Adding assumed 
pod to cache {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "node": "10.10.10.66", "allBound": true}
2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:411 Putting pod in 
cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.926Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (AssumePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.926Z  DEBUG scheduler/context.go:853  Successfully synced 
shim on new allocation. response: no. of allocations: 1
2024-06-18T15:15:28.926Z  INFO  cache/task_state.go:380 Task state transition 
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
"spark/spark-pi-f4f19b902beac663-driver", "source": "Scheduling", 
"destination": "Allocated", "event": "TaskAllocated"}
2024-06-18T15:15:28.926Z  DEBUG cache/task.go:349 bind pod volumes  {"podName": 
"spark-pi-f4f19b902beac663-driver", "podUID": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.926Z  INFO  cache/context.go:499  Binding Pod Volumes 
skipped: all volumes already bound  {"podName": 
"spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:28.926Z  DEBUG cache/task.go:362 bind pod  {"podName": 
"spark-pi-f4f19b902beac663-driver", "podUID": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.926Z  INFO  client/kubeclient.go:112  bind pod to node  
{"podName": "spark-pi-f4f19b902beac663-driver", "podUID": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "nodeID": "10.10.10.66"}
2024-06-18T15:15:28.933Z  INFO  cache/task.go:375 successfully bound pod  
{"podName": "spark-pi-f4f19b902beac663-driver"}
2024-06-18T15:15:28.934Z  INFO  cache/task_state.go:380 Task state transition 
{"app": "spark-8e410a4c5ce44da2aa85ba835257a1e9", "task": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a", "taskAlias": 
"spark/spark-pi-f4f19b902beac663-driver", "source": "Allocated", "destination": 
"Bound", "event": "TaskBound"}
2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:411 Putting pod in 
cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.934Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:411 Putting pod in 
cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:28.946Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:29.278Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:411 Putting pod in 
cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:29.906Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Pre) {"nodes": 3, "pods": 56, "assumed": 1, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":3,"Running":53}}
2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:411 Putting pod in 
cache  {"podName": "spark-pi-f4f19b902beac663-driver", "podKey": 
"4fb897dd-3af5-4799-a09d-640b5222ba3a"}
2024-06-18T15:15:30.113Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdatePod.Post)  {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":2,"Running":54}}
2024-06-18T15:15:30.279Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:31.279Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:32.280Z  DEBUG scheduler/scheduler.go:157  inspect outstanding 
requests
2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdateNode.Pre)  {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":2,"Running":54}}
2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:179 Updating node 
in cache  {"nodeName": "10.10.10.66"}
2024-06-18T15:15:32.395Z  DEBUG external/scheduler_cache.go:558 Scheduler cache 
state (UpdateNode.Post) {"nodes": 3, "pods": 56, "assumed": 0, "pendingAllocs": 
0, "inProgressAllocs": 0, "podsAssigned": 56, "phases": 
{"Pending":2,"Running":54}}
2024-06-18T15:15:32.395Z  DEBUG cache/node.go:109 set node capacity {"nodeID": 
"10.10.10.66", "capacity": "resources:<key:\"ephemeral-storage\" 
value:<value:478652923105 > > resources:<key:\"hugepages-1Gi\" value:<> > 
resources:<key:\"hugepages-2Mi\" value:<> > 
resources:<key:\"kubernetes.io/batch-cpu\" value:<value:5167 > > 
resources:<key:\"kubernetes.io/batch-memory\" value:<value:34437364710 > > 
resources:<key:\"kubernetes.io/mid-cpu\" value:<value:3156 > > 
resources:<key:\"kubernetes.io/mid-memory\" value:<value:1442139746 > > 
resources:<key:\"memory\" value:<value:64713940992 > > resources:<key:\"pods\" 
value:<value:110 > > resources:<key:\"vcore\" value:<value:14000 > > "}
2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:179  Node's ready status flag  
{"Node name": "10.10.10.66", "ready": true}
2024-06-18T15:15:32.396Z  INFO  cache/nodes.go:184  report updated nodes to 
scheduler {"request": 
{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}}
2024-06-18T15:15:32.396Z  DEBUG scheduler/scheduler.go:96 enqueued event  
{"eventType": "*rmevent.RMUpdateNodeEvent", "event": 
{"Request":{"nodes":[{"nodeID":"10.10.10.66","action":2,"attributes":{"ready":"true","si/node-partition":"[mycluster]default"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":478652923105},"hugepages-1Gi":{},"hugepages-2Mi":{},"kubernetes.io/batch-cpu":{"value":5167},"kubernetes.io/batch-memory":{"value":34437364710},"kubernetes.io/mid-cpu":{"value":3156},"kubernetes.io/mid-memory":{"value":1442139746},"memory":{"value":64713940992},"pods":{"value":110},"vcore":{"value":14000}}},"occupiedResource":{"resources":{"memory":{"value":4229955584},"pods":{"value":16},"vcore":{"value":4912}}}}],"rmID":"mycluster"}},
 "currentQueueSize": 0}
2024-06-18T15:15:32.396Z  INFO  objects/queue.go:1190 updating root queue max 
resources {"current max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 
hugepages-2Mi:0 kubernetes.io/batch-cpu:19610 
kubernetes.io/batch-memory:109317941106 kubernetes.io/mid-cpu:9574 
kubernetes.io/mid-memory:5209251314 memory:194141839360 pods:330 vcore:42000]", 
"new max": "map[ephemeral-storage:1435958769315 hugepages-1Gi:0 hugepages-2Mi:0 
kubernetes.io/batch-cpu:19309 kubernetes.io/batch-memory:109530403859 
kubernetes.io/mid-cpu:10574 kubernetes.io/mid-memory:6651391060 
memory:194141839360 pods:330 vcore:42000]"}
2024-06-18T15:15:32.396Z  DEBUG objects/node.go:182 skip updating 
occupiedResource, not changed {code}
 

 
h2. Spark Pod status
kubectl get pod -n spark|grep Running
!image-2024-06-19-00-02-53-178.png!
kubectl describe pod spark-pi-xxxxxxxx-driver -n spark
!image-2024-06-19-00-03-09-703.png!
 
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to