[ https://issues.apache.org/jira/browse/YUNIKORN-2645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wilfred Spiegelenburg updated YUNIKORN-2645: -------------------------------------------- Issue Type: New Feature (was: Bug) > parent queue exceeds maximum resource > ------------------------------------- > > Key: YUNIKORN-2645 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2645 > Project: Apache YuniKorn > Issue Type: New Feature > Components: core - scheduler > Affects Versions: 1.5.1 > Reporter: Dmitry > Priority: Major > Attachments: yunikorn-logs.txt.gz > > > We had a node broken in the cluster - kubernetes was creating pods which were > immediately failing with "OutOfGPU" state. The node had 1000+ pods on it. > The scheduler panicked with the log attached and was not scheduling any other > pods. > The config: > {code:yaml} > apiVersion: v1 > data: > admissionController.filtering.bypassNamespaces: > ^kube-system$,^rook$,^rook-east$,^rook-central$,^rook-pacific$,^rook-south-east$,^rook-system$ > queues.yaml: | > partitions: > - name: default > placementrules: > - name: fixed > value: root.scavenging.osg > create: true > filter: > type: allow > users: > - system:serviceaccount:osg-ligo:prp-htcondor-provisioner > - > system:serviceaccount:osg-opportunistic:prp-htcondor-provisioner > - system:serviceaccount:osg-icecube:prp-htcondor-provisioner > - name: tag > value: namespace > create: true > parent: > name: tag > value: namespace.parentqueue > - name: tag > value: namespace > create: true > parent: > name: fixed > value: general > nodesortpolicy: > type: fair > resourceweights: > vcore: 1.0 > memory: 1.0 > nvidia.com/gpu: 4.0 > queues: > - name: root > submitacl: '*' > properties: > application.sort.policy: fair > queues: > - name: system > parent: true > properties: > preemption.policy: disabled > - name: general > parent: true > childtemplate: > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 100 > memory: 1Ti > nvidia.com/gpu: 8 > max: > vcore: 4000 > memory: 15Ti > nvidia.com/gpu: 200 > - name: scavenging > parent: true > childtemplate: > resources: > guaranteed: > vcore: 1 > memory: 1G > nvidia.com/gpu: 1 > properties: > priority.offset: "-10" > - name: interactive > parent: true > childtemplate: > resources: > guaranteed: > vcore: 1000 > memory: 10T > nvidia.com/gpu: 48 > nvidia.com/a100: 4 > properties: > priority.offset: "10" > preemption.policy: disabled > - name: clemson > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 256 > memory: 2T > nvidia.com/gpu: 24 > - name: nysernet > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 1000 > memory: 5T > nvidia.com/gpu: 16 > - name: gpn > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 5000 > memory: 50T > nvidia.com/gpu: 256 > nvidia.com/a100: 16 > - name: sdsu > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 1000 > memory: 15T > nvidia.com/gpu: 112 > nvidia.com/a100: 64 > queues: > - name: sdsu-jupyterhub > parent: false > properties: > preemption.policy: disabled > priority.offset: "10" > resources: > guaranteed: > vcore: 700 > memory: 5T > nvidia.com/gpu: 100 > - name: tide > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 592 > memory: 15T > nvidia.com/gpu: 72 > queues: > - name: rook-tide > parent: false > properties: > preemption.policy: disabled > priority.offset: "10" > resources: > guaranteed: > vcore: 500 > memory: 1T > - name: ucsc > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 500 > memory: 4T > nvidia.com/gpu: 256 > - name: ucsd > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 40000 > memory: 40T > nvidia.com/gpu: 512 > nvidia.com/a100: 100 > queues: > - name: ry > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 512 > memory: 8T > nvidia.com/gpu: 144 > - name: suncave > parent: false > properties: > preemption.policy: disabled > priority.offset: "10" > resources: > guaranteed: > vcore: 1000 > memory: 1T > - name: dimm > parent: false > properties: > preemption.policy: disabled > priority.offset: "1000" > resources: > guaranteed: > vcore: 1000 > memory: 1T > - name: haosu > parent: true > properties: > application.sort.policy: fair > resources: > guaranteed: > vcore: 5000 > memory: 10T > nvidia.com/gpu: 120 > queues: > - name: rook-haosu > parent: false > properties: > preemption.policy: disabled > priority.offset: "10" > resources: > guaranteed: > vcore: 1000 > memory: 1T > kind: ConfigMap > metadata: > creationTimestamp: "2023-12-21T06:09:12Z" > name: yunikorn-configs > namespace: yunikorn > resourceVersion: "7764804169" > uid: 5b9b2c04-57af-4cab-84f8-b5f018952f9c > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org