[jira] [Created] (YUNIKORN-2540) clean up constants in pkg/cache/context_test.go

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2540:
---

 Summary: clean up constants in pkg/cache/context_test.go
 Key: YUNIKORN-2540
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2540
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


Constants are duplicated in the {{pkg/cache/context_test.go}}

example {{fakeNodeName}} is defined multiple times in the files. We should move 
to a central point of defining the constants for the test at the top of the 
file. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2539) Add optional deadlock detection

2024-04-04 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2539:
--

 Summary: Add optional deadlock detection
 Key: YUNIKORN-2539
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2539
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler, shim - kubernetes
Reporter: Craig Condit
Assignee: Craig Condit


We make heavy use of sync.Mutex and sync.RWMutex in our code. Unfortunately, 
while these are very performant, they can lead to difficult-to-diagnose 
deadlocks.

If we substitute our own locking routines, we can optionally enable deadlock 
detection. See [https://github.com/sasha-s/go-deadlock] for a possible solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2525) dispatcher.Stop() waits an extra second unnecessarily

2024-04-04 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2525.

Fix Version/s: 1.6.0
   Resolution: Fixed

> dispatcher.Stop() waits an extra second unnecessarily
> -
>
> Key: YUNIKORN-2525
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2525
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> {{dispatcher.Stop()}} takes sometimes an extra 1 second to shut down 
> properly. This slows down unit tests. On my machine, {{context_test.go}} runs 
> for 19-20 seconds. With some improvements, this can be improved to 1 second.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0

2024-04-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit closed YUNIKORN-2534.
--

> [Yunikorn] Quota enforcement checks are failing when we have max-application 
> set to 0
> -
>
> Key: YUNIKORN-2534
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2534
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Rajesh Kanhaiya Lal
>Priority: Major
> Attachments: yunikorn-configs-fresh.yaml
>
>
> The Max-application checks are not working when we are setting 
> max-application to 0 in the yunikorn-config file.
> The Config validation is also ignored in case of max-application is set to 0, 
> for example, the child max-application should be less or equal to the parent 
> queue is also not working when we have the max-application set to 0.
> Attached Yunikorn Config file
> User and Group tracking API also does not log max-application in the response.
>  
> {code:java}
> curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users'
> [
>     {
>         "userName": "nobody",
>         "groups": {
>             "ts333w3": "*",
>             "ts433": "*",
>             "ts544": "*",
>             "ts633": "*"
>         },
>         "queues": {
>             "queuePath": "root",
>             "resourceUsage": {
>                 "Resources": {
>                     "memory": 3,
>                     "pods": 3,
>                     "vcore": 300
>                 }
>             },
>             "runningApplications": [
>                 "ts333w3",
>                 "ts433",
>                 "ts544"
>             ],
>             "children": [
>                 {
>                     "queuePath": "root.default",
>                     "resourceUsage": {
>                         "Resources": {
>                             "memory": 3,
>                             "pods": 3,
>                             "vcore": 300
>                         }
>                     },
>                     "runningApplications": [
>                         "ts333w3",
>                         "ts433",
>                         "ts544"
>                     ]
>                 }
>             ]
>         }
>     }
> ] {code}
> Could You please take a look ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0

2024-04-04 Thread Craig Condit (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Condit resolved YUNIKORN-2534.

  Assignee: (was: Manikandan R)
Resolution: Not A Bug

This is not a bug. A value of zero is indistinguishable from unset, and we 
explicitly treat it the same.

> [Yunikorn] Quota enforcement checks are failing when we have max-application 
> set to 0
> -
>
> Key: YUNIKORN-2534
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2534
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Rajesh Kanhaiya Lal
>Priority: Major
> Attachments: yunikorn-configs-fresh.yaml
>
>
> The Max-application checks are not working when we are setting 
> max-application to 0 in the yunikorn-config file.
> The Config validation is also ignored in case of max-application is set to 0, 
> for example, the child max-application should be less or equal to the parent 
> queue is also not working when we have the max-application set to 0.
> Attached Yunikorn Config file
> User and Group tracking API also does not log max-application in the response.
>  
> {code:java}
> curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users'
> [
>     {
>         "userName": "nobody",
>         "groups": {
>             "ts333w3": "*",
>             "ts433": "*",
>             "ts544": "*",
>             "ts633": "*"
>         },
>         "queues": {
>             "queuePath": "root",
>             "resourceUsage": {
>                 "Resources": {
>                     "memory": 3,
>                     "pods": 3,
>                     "vcore": 300
>                 }
>             },
>             "runningApplications": [
>                 "ts333w3",
>                 "ts433",
>                 "ts544"
>             ],
>             "children": [
>                 {
>                     "queuePath": "root.default",
>                     "resourceUsage": {
>                         "Resources": {
>                             "memory": 3,
>                             "pods": 3,
>                             "vcore": 300
>                         }
>                     },
>                     "runningApplications": [
>                         "ts333w3",
>                         "ts433",
>                         "ts544"
>                     ]
>                 }
>             ]
>         }
>     }
> ] {code}
> Could You please take a look ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2520) PVC errors in AssumePod() are not handled properly

2024-04-04 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YUNIKORN-2520.
-
Fix Version/s: 1.6.0
   Resolution: Fixed

Changes merged to master

Volume issues should be handled correctly now.

> PVC errors in AssumePod() are not handled properly
> --
>
> Key: YUNIKORN-2520
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2520
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> When there is an error caused by a volume operation in 
> {{Context.AssumePod()}}, the allocation on core side will not be removed.
> Although we check the result from {{UpdateAllocation}}, the error handling is 
> just logging:
> {noformat}
> if err := callback.UpdateAllocation(response); err != nil {
>   rmp.handleUpdateResponseError(rmID, err)
>   }
> ...
> func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
> log.Log(log.RMProxy).Error("failed to handle response",
>zap.String("rmID", rmID),
>zap.Error(err))
> }{noformat}
> I suggest moving volume-related code to {{{}Task.postTaskAllocated()}}. In 
> this case, the task will transition to "Failed" state and we'll have 
> allocationID available, so we can release both the ask and the allocation:
> {noformat}
> func (task *Task) releaseAllocation() {
>   ...
>   var releaseRequest *si.AllocationRequest
>   s := TaskStates()
>   switch task.GetTaskState() {
>   case s.New, s.Pending, s.Scheduling, s.Rejected:
>   releaseRequest = common.CreateReleaseAskRequestForTask(
>   task.applicationID, task.taskID, 
> task.application.partition)  <-- release ask + allocation if possible
>   default:
>   if task.allocationID == "" {
>   ... log error ...
>   return
>   }
>   releaseRequest = 
> common.CreateReleaseAllocationRequestForTask(
>   task.applicationID, task.taskID, 
> task.allocationID, task.application.partition, task.terminationType)
>   }
> ...{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2538) Shim cache context pre-allocate slice

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2538:
---

 Summary: Shim cache context pre-allocate slice
 Key: YUNIKORN-2538
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2538
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


When building the reason string from all volume failure reasons we should 
allocate a slice once based on the size of the reasons object we get returned.

See [review 
comment|https://github.com/apache/yunikorn-k8shim/pull/810#discussion_r1550882867]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2537) cleanup UpdateAllocation in callback

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2537:
---

 Summary: cleanup UpdateAllocation in callback
 Key: YUNIKORN-2537
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2537
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


UpdateAllocation needs a cleanup: {{getTask()}} already checks for the 
application. No need to retrieve the application when we process response.New. 
Sending an event should be linked to the existence of the task not of the 
application.

On top of that we have the appID already in the task so we do not need to get 
it from the app.

The same logic needs to be applied to the whole function, we already do it for 
the release.* handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2536) Create a design doc

2024-04-04 Thread Rainie Li (Jira)
Rainie Li created YUNIKORN-2536:
---

 Summary: Create a design doc
 Key: YUNIKORN-2536
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2536
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Rainie Li
Assignee: Rainie Li






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2535) [Umbrella] YuniKorn: Dynamically Adjust Queue to Ensure App Service Level Objectives (SLO)

2024-04-04 Thread Rainie Li (Jira)
Rainie Li created YUNIKORN-2535:
---

 Summary: [Umbrella] YuniKorn: Dynamically Adjust Queue to Ensure 
App Service Level Objectives (SLO)
 Key: YUNIKORN-2535
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2535
 Project: Apache YuniKorn
  Issue Type: New Feature
  Components: core - common, core - scheduler, shim - kubernetes
Reporter: Rainie Li
Assignee: Rainie Li


We want to guarantee SLO for critical apps when a cluster has limited resources.
YuniKorn can track applications' actual status and dynamically adjusting 
queues(resource, priority, etc) based on the SLO of applications.

Here is [initial proposal Feature 
2|[https://docs.google.com/document/d/1c8tzmEgl32o6_0eDxQ1ZiMRcD-1QRvdu-3PABNZQX30/edit]
 ]

We had some discussions with the community during meetup on 04/03/24.

Next step: we will create a more detailed design doc and review with the 
community. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2534) [Yunikorn] Quota enforcement checks are failing when we have max-application set to 0

2024-04-04 Thread Rajesh Kanhaiya Lal (Jira)
Rajesh Kanhaiya Lal created YUNIKORN-2534:
-

 Summary: [Yunikorn] Quota enforcement checks are failing when we 
have max-application set to 0
 Key: YUNIKORN-2534
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2534
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Rajesh Kanhaiya Lal
Assignee: Manikandan R
 Attachments: yunikorn-configs-fresh.yaml

The Max-application checks are not working when we are setting max-application 
to 0 in the yunikorn-config file.

The Config validation is also ignored in case of max-application is set to 0, 
for example, the child max-application should be less or equal to the parent 
queue is also not working when we have the max-application set to 0.

Attached Yunikorn Config file

User and Group tracking API also does not log max-application in the response.

 
{code:java}
curl --location 'http://127.0.0.1:9080/ws/v1/partition/default/usage/users'

[
    {
        "userName": "nobody",
        "groups": {
            "ts333w3": "*",
            "ts433": "*",
            "ts544": "*",
            "ts633": "*"
        },
        "queues": {
            "queuePath": "root",
            "resourceUsage": {
                "Resources": {
                    "memory": 3,
                    "pods": 3,
                    "vcore": 300
                }
            },
            "runningApplications": [
                "ts333w3",
                "ts433",
                "ts544"
            ],
            "children": [
                {
                    "queuePath": "root.default",
                    "resourceUsage": {
                        "Resources": {
                            "memory": 3,
                            "pods": 3,
                            "vcore": 300
                        }
                    },
                    "runningApplications": [
                        "ts333w3",
                        "ts433",
                        "ts544"
                    ]
                }
            ]
        }
    }
] {code}
Could You please take a look ?

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2533) Implement String() for TrackedResource

2024-04-04 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2533:
---

 Summary: Implement String() for TrackedResource
 Key: YUNIKORN-2533
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2533
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Wilfred Spiegelenburg


To fix the way TrackedResources are logged it should implement the String() 
function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org