[jira] [Created] (YUNIKORN-2540) clean up constants in pkg/cache/context_test.go
Wilfred Spiegelenburg created YUNIKORN-2540: --- Summary: clean up constants in pkg/cache/context_test.go Key: YUNIKORN-2540 URL: https://issues.apache.org/jira/browse/YUNIKORN-2540 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg Constants are duplicated in the {{pkg/cache/context_test.go}} example {{fakeNodeName}} is defined multiple times in the files. We should move to a central point of defining the constants for the test at the top of the file. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2539) Add optional deadlock detection
Craig Condit created YUNIKORN-2539: -- Summary: Add optional deadlock detection Key: YUNIKORN-2539 URL: https://issues.apache.org/jira/browse/YUNIKORN-2539 Project: Apache YuniKorn Issue Type: Improvement Components: core - scheduler, shim - kubernetes Reporter: Craig Condit Assignee: Craig Condit We make heavy use of sync.Mutex and sync.RWMutex in our code. Unfortunately, while these are very performant, they can lead to difficult-to-diagnose deadlocks. If we substitute our own locking routines, we can optionally enable deadlock detection. See [https://github.com/sasha-s/go-deadlock] for a possible solution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2525) dispatcher.Stop() waits an extra second unnecessarily
[ https://issues.apache.org/jira/browse/YUNIKORN-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2525. Fix Version/s: 1.6.0 Resolution: Fixed > dispatcher.Stop() waits an extra second unnecessarily > - > > Key: YUNIKORN-2525 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2525 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Minor > Labels: pull-request-available > Fix For: 1.6.0 > > > {{dispatcher.Stop()}} takes sometimes an extra 1 second to shut down > properly. This slows down unit tests. On my machine, {{context_test.go}} runs > for 19-20 seconds. With some improvements, this can be improved to 1 second. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Closed] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0
[ https://issues.apache.org/jira/browse/YUNIKORN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit closed YUNIKORN-2534. -- > [Yunikorn] Quota enforcement checks are failing when we have max-application > set to 0 > - > > Key: YUNIKORN-2534 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2534 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Rajesh Kanhaiya Lal >Priority: Major > Attachments: yunikorn-configs-fresh.yaml > > > The Max-application checks are not working when we are setting > max-application to 0 in the yunikorn-config file. > The Config validation is also ignored in case of max-application is set to 0, > for example, the child max-application should be less or equal to the parent > queue is also not working when we have the max-application set to 0. > Attached Yunikorn Config file > User and Group tracking API also does not log max-application in the response. > > {code:java} > curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users' > [ > { > "userName": "nobody", > "groups": { > "ts333w3": "*", > "ts433": "*", > "ts544": "*", > "ts633": "*" > }, > "queues": { > "queuePath": "root", > "resourceUsage": { > "Resources": { > "memory": 3, > "pods": 3, > "vcore": 300 > } > }, > "runningApplications": [ > "ts333w3", > "ts433", > "ts544" > ], > "children": [ > { > "queuePath": "root.default", > "resourceUsage": { > "Resources": { > "memory": 3, > "pods": 3, > "vcore": 300 > } > }, > "runningApplications": [ > "ts333w3", > "ts433", > "ts544" > ] > } > ] > } > } > ] {code} > Could You please take a look ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0
[ https://issues.apache.org/jira/browse/YUNIKORN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Condit resolved YUNIKORN-2534. Assignee: (was: Manikandan R) Resolution: Not A Bug This is not a bug. A value of zero is indistinguishable from unset, and we explicitly treat it the same. > [Yunikorn] Quota enforcement checks are failing when we have max-application > set to 0 > - > > Key: YUNIKORN-2534 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2534 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Rajesh Kanhaiya Lal >Priority: Major > Attachments: yunikorn-configs-fresh.yaml > > > The Max-application checks are not working when we are setting > max-application to 0 in the yunikorn-config file. > The Config validation is also ignored in case of max-application is set to 0, > for example, the child max-application should be less or equal to the parent > queue is also not working when we have the max-application set to 0. > Attached Yunikorn Config file > User and Group tracking API also does not log max-application in the response. > > {code:java} > curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users' > [ > { > "userName": "nobody", > "groups": { > "ts333w3": "*", > "ts433": "*", > "ts544": "*", > "ts633": "*" > }, > "queues": { > "queuePath": "root", > "resourceUsage": { > "Resources": { > "memory": 3, > "pods": 3, > "vcore": 300 > } > }, > "runningApplications": [ > "ts333w3", > "ts433", > "ts544" > ], > "children": [ > { > "queuePath": "root.default", > "resourceUsage": { > "Resources": { > "memory": 3, > "pods": 3, > "vcore": 300 > } > }, > "runningApplications": [ > "ts333w3", > "ts433", > "ts544" > ] > } > ] > } > } > ] {code} > Could You please take a look ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2520) PVC errors in AssumePod() are not handled properly
[ https://issues.apache.org/jira/browse/YUNIKORN-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg resolved YUNIKORN-2520. - Fix Version/s: 1.6.0 Resolution: Fixed Changes merged to master Volume issues should be handled correctly now. > PVC errors in AssumePod() are not handled properly > -- > > Key: YUNIKORN-2520 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2520 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > > When there is an error caused by a volume operation in > {{Context.AssumePod()}}, the allocation on core side will not be removed. > Although we check the result from {{UpdateAllocation}}, the error handling is > just logging: > {noformat} > if err := callback.UpdateAllocation(response); err != nil { > rmp.handleUpdateResponseError(rmID, err) > } > ... > func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) { > log.Log(log.RMProxy).Error("failed to handle response", >zap.String("rmID", rmID), >zap.Error(err)) > }{noformat} > I suggest moving volume-related code to {{{}Task.postTaskAllocated()}}. In > this case, the task will transition to "Failed" state and we'll have > allocationID available, so we can release both the ask and the allocation: > {noformat} > func (task *Task) releaseAllocation() { > ... > var releaseRequest *si.AllocationRequest > s := TaskStates() > switch task.GetTaskState() { > case s.New, s.Pending, s.Scheduling, s.Rejected: > releaseRequest = common.CreateReleaseAskRequestForTask( > task.applicationID, task.taskID, > task.application.partition) <-- release ask + allocation if possible > default: > if task.allocationID == "" { > ... log error ... > return > } > releaseRequest = > common.CreateReleaseAllocationRequestForTask( > task.applicationID, task.taskID, > task.allocationID, task.application.partition, task.terminationType) > } > ...{noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2538) Shim cache context pre-allocate slice
Wilfred Spiegelenburg created YUNIKORN-2538: --- Summary: Shim cache context pre-allocate slice Key: YUNIKORN-2538 URL: https://issues.apache.org/jira/browse/YUNIKORN-2538 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg When building the reason string from all volume failure reasons we should allocate a slice once based on the size of the reasons object we get returned. See [review comment|https://github.com/apache/yunikorn-k8shim/pull/810#discussion_r1550882867] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2537) cleanup UpdateAllocation in callback
Wilfred Spiegelenburg created YUNIKORN-2537: --- Summary: cleanup UpdateAllocation in callback Key: YUNIKORN-2537 URL: https://issues.apache.org/jira/browse/YUNIKORN-2537 Project: Apache YuniKorn Issue Type: Improvement Components: shim - kubernetes Reporter: Wilfred Spiegelenburg UpdateAllocation needs a cleanup: {{getTask()}} already checks for the application. No need to retrieve the application when we process response.New. Sending an event should be linked to the existence of the task not of the application. On top of that we have the appID already in the task so we do not need to get it from the app. The same logic needs to be applied to the whole function, we already do it for the release.* handling. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2536) Create a design doc
Rainie Li created YUNIKORN-2536: --- Summary: Create a design doc Key: YUNIKORN-2536 URL: https://issues.apache.org/jira/browse/YUNIKORN-2536 Project: Apache YuniKorn Issue Type: Sub-task Reporter: Rainie Li Assignee: Rainie Li -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2535) [Umbrella] YuniKorn: Dynamically Adjust Queue to Ensure App Service Level Objectives (SLO)
Rainie Li created YUNIKORN-2535: --- Summary: [Umbrella] YuniKorn: Dynamically Adjust Queue to Ensure App Service Level Objectives (SLO) Key: YUNIKORN-2535 URL: https://issues.apache.org/jira/browse/YUNIKORN-2535 Project: Apache YuniKorn Issue Type: New Feature Components: core - common, core - scheduler, shim - kubernetes Reporter: Rainie Li Assignee: Rainie Li We want to guarantee SLO for critical apps when a cluster has limited resources. YuniKorn can track applications' actual status and dynamically adjusting queues(resource, priority, etc) based on the SLO of applications. Here is [initial proposal Feature 2|[https://docs.google.com/document/d/1c8tzmEgl32o6_0eDxQ1ZiMRcD-1QRvdu-3PABNZQX30/edit] ] We had some discussions with the community during meetup on 04/03/24. Next step: we will create a more detailed design doc and review with the community. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0
Rajesh Kanhaiya Lal created YUNIKORN-2534: - Summary: [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0 Key: YUNIKORN-2534 URL: https://issues.apache.org/jira/browse/YUNIKORN-2534 Project: Apache YuniKorn Issue Type: Bug Components: shim - kubernetes Reporter: Rajesh Kanhaiya Lal Assignee: Manikandan R Attachments: yunikorn-configs-fresh.yaml The Max-application checks are not working when we are setting max-application to 0 in the yunikorn-config file. The Config validation is also ignored in case of max-application is set to 0, for example, the child max-application should be less or equal to the parent queue is also not working when we have the max-application set to 0. Attached Yunikorn Config file User and Group tracking API also does not log max-application in the response. {code:java} curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users' [ { "userName": "nobody", "groups": { "ts333w3": "*", "ts433": "*", "ts544": "*", "ts633": "*" }, "queues": { "queuePath": "root", "resourceUsage": { "Resources": { "memory": 3, "pods": 3, "vcore": 300 } }, "runningApplications": [ "ts333w3", "ts433", "ts544" ], "children": [ { "queuePath": "root.default", "resourceUsage": { "Resources": { "memory": 3, "pods": 3, "vcore": 300 } }, "runningApplications": [ "ts333w3", "ts433", "ts544" ] } ] } } ] {code} Could You please take a look ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2533) Implement String() for TrackedResource
Wilfred Spiegelenburg created YUNIKORN-2533: --- Summary: Implement String() for TrackedResource Key: YUNIKORN-2533 URL: https://issues.apache.org/jira/browse/YUNIKORN-2533 Project: Apache YuniKorn Issue Type: Improvement Components: core - common Reporter: Wilfred Spiegelenburg To fix the way TrackedResources are logged it should implement the String() function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org