[jira] [Created] (YUNIKORN-2654) Remove unused code in k8shim context

2024-05-30 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2654:
---

 Summary: Remove unused code in k8shim context
 Key: YUNIKORN-2654
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2654
 Project: Apache YuniKorn
  Issue Type: Task
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg


The NotifyApplicationComplete and NotifyApplicationFail  function are not 
called by anything and are unused code.

The K8shim does not trigger the application completion or failure. This is 
triggered by the core when the application no longer has any activity 
registered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2653:
-
Labels: pull-request-available  (was: )

> Gang scheduling K8s event formatting compliance
> ---
>
> Key: YUNIKORN-2653
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2653
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
>
> The K8s events provide definitions and rules around the content of the fields 
> within the event. Adjust the content of gang scheduling related events to 
> comply with the rules.
> Focussed on the reason and action fields only.
>   * 'reason' is the reason this event is generated. 'reason' should be short 
> and unique; it should be in UpperCamelCase format (starting with a capital 
> letter). 
>  * 'action' explains what happened with regarding/ what action did the 
> ReportingController take in objects name; it should be in UpperCamelCase 
> format (starting with a capital letter). 
> No space or long text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance

2024-05-30 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YUNIKORN-2653:
---

 Summary: Gang scheduling K8s event formatting compliance
 Key: YUNIKORN-2653
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2653
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The K8s events provide definitions and rules around the content of the fields 
within the event. Adjust the content of gang scheduling related events to 
comply with the rules.
Focussed on the reason and action fields only.
  * 'reason' is the reason this event is generated. 'reason' should be short 
and unique; it should be in UpperCamelCase format (starting with a capital 
letter). 
 * 'action' explains what happened with regarding/ what action did the 
ReportingController take in objects name; it should be in UpperCamelCase format 
(starting with a capital letter). 

No space or long text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-182) fix lint issues

2024-05-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850896#comment-17850896
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-182:


File a new Jira for this, it needs to be fixed in all our http servers we 
create in our code, those are spread over multiple repositories and all need to 
be checked:
{code:java}
pkg/cmd/admissioncontroller/main.go:143:15: G112: Potential Slowloris Attack 
because ReadHeaderTimeout is not configured in the http.Server (gosec) {code}
This one should get an ignore from the lint side, we do not need crypt quality 
random here;
{code:java}
test/e2e/framework/helpers/common/utils.go:105:18: G404: Use of weak random 
number generator (math/rand instead of crypto/rand) (gosec)
b[i] = letters[rand.Intn(len(letters))]{code}
All the ineffective assigns and shadowing remarks can and should be fixed.

Formatting issues can snd should be fixed

The function length ones are dubious and we probably should just add the 
{{//nolint:funlen}} remark on them specially since they are almost all test 
functions.

> fix lint issues
> ---
>
> Key: YUNIKORN-182
> URL: https://issues.apache.org/jira/browse/YUNIKORN-182
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: build
>Reporter: Wilfred Spiegelenburg
>Assignee: Yun Sun
>Priority: Minor
>  Labels: pull-request-available
>
> When we added the lint test most major issues were fixed. There are still a 
> lot of issues specially in tests that need to be fixed.
> This is a container Jira to track that work on both the k8shim as the core 
> repos.
> Work should be split into multiple parts (per linter?)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2581) Expose running placement rules in REST

2024-05-30 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850889#comment-17850889
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2581:
-

code change committed, working on documentation before closing

> Expose running placement rules in REST
> --
>
> Key: YUNIKORN-2581
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2581
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> Since introducing the use of placement rules always and the recovery rule the 
> queue config does not correctly show the running rules.
> Also if a config update has been rejected, for any reason, the rules would 
> not be correct
> Exposing the configured rules from the placement manager works around all 
> these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



(yunikorn-core) branch master updated: [YUNIKORN-2581] Expose running placement rules in REST (#857)

2024-05-30 Thread wilfreds
This is an automated email from the ASF dual-hosted git repository.

wilfreds pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git


The following commit(s) were added to refs/heads/master by this push:
 new 81d69749 [YUNIKORN-2581] Expose running placement rules in REST (#857)
81d69749 is described below

commit 81d697494d972d01c01135dd981e1ec2fc89f57d
Author: Wilfred Spiegelenburg 
AuthorDate: Fri May 31 11:26:57 2024 +1000

[YUNIKORN-2581] Expose running placement rules in REST (#857)

Add a new REST endpoint under the partition that returns the currently
active placement rules:
/ws/v1/partition/:partition/placementrules
The call retrieves the active set from the placement manager. This will
include the recovery rule. The rule interface is extended with a new
private method ruleDAO() that must be implemented by all rules and
returns the RuleDAO object.

Add the currently active set of placement rules to the state dump. This
handles multiple partitions. It wraps the list of RuleDAO objects in a
RuleDAOInfo per partition.

Closes: #857

Signed-off-by: Wilfred Spiegelenburg 
---
 go.mod|  1 +
 go.sum|  2 +
 pkg/scheduler/partition.go|  9 ++-
 pkg/scheduler/placement/filter.go | 47 -
 pkg/scheduler/placement/filter_test.go| 65 +-
 pkg/scheduler/placement/fixed_rule.go | 19 +
 pkg/scheduler/placement/fixed_rule_test.go| 38 ++
 pkg/scheduler/placement/placement.go  | 23 +--
 pkg/scheduler/placement/placement_test.go | 51 --
 pkg/scheduler/placement/provided_rule.go  | 17 +
 pkg/scheduler/placement/provided_rule_test.go | 33 +
 pkg/scheduler/placement/recovery_rule.go  | 10 +++
 pkg/scheduler/placement/recovery_rule_test.go |  9 +++
 pkg/scheduler/placement/rule.go   | 44 
 pkg/scheduler/placement/rule_test.go  |  9 +++
 pkg/scheduler/placement/tag_rule.go   | 18 +
 pkg/scheduler/placement/tag_rule_test.go  | 38 ++
 pkg/scheduler/placement/testrule.go   | 17 +
 pkg/scheduler/placement/user_rule.go  | 17 +
 pkg/scheduler/placement/user_rule_test.go | 33 +
 pkg/webservice/dao/rule_info.go   | 39 +++
 pkg/webservice/handlers.go| 32 +
 pkg/webservice/handlers_test.go   | 99 ---
 pkg/webservice/routes.go  |  6 ++
 pkg/webservice/state_dump.go  |  2 +
 25 files changed, 623 insertions(+), 55 deletions(-)

diff --git a/go.mod b/go.mod
index 65b2c5a8..5f180926 100644
--- a/go.mod
+++ b/go.mod
@@ -33,6 +33,7 @@ require (
github.com/prometheus/common v0.45.0
github.com/sasha-s/go-deadlock v0.3.1
go.uber.org/zap v1.26.0
+   golang.org/x/exp v0.0.0-20240409090435-93d18d7e34b8
golang.org/x/net v0.21.0
golang.org/x/time v0.5.0
google.golang.org/grpc v1.58.3
diff --git a/go.sum b/go.sum
index 35ee67e4..f04fef47 100644
--- a/go.sum
+++ b/go.sum
@@ -50,6 +50,8 @@ go.uber.org/multierr v1.10.0 
h1:S0h4aNzvfcFsC3dRF1jLoaov7oRaKqRGC/pUEJ2yvPQ=
 go.uber.org/multierr v1.10.0/go.mod 
h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y=
 go.uber.org/zap v1.26.0 h1:sI7k6L95XOKS281NhVKOFCUNIvv9e0w4BF8N3u+tCRo=
 go.uber.org/zap v1.26.0/go.mod h1:dtElttAiwGvoJ/vj4IwHBS/gXsEu/pZ50mUIRWuG0so=
+golang.org/x/exp v0.0.0-20240409090435-93d18d7e34b8 
h1:ESSUROHIBHg7USnszlcdmjBEwdMj9VUvU+OPk4yl2mc=
+golang.org/x/exp v0.0.0-20240409090435-93d18d7e34b8/go.mod 
h1:/lliqkxwWAhPjf5oSOIJup2XcqJaw8RGS6k3TGEc7GI=
 golang.org/x/net v0.23.0 h1:7EYJ93RZ9vYSZAIb2x3lnuvqO5zneoD6IvWjuhfxjTs=
 golang.org/x/net v0.23.0/go.mod h1:JKghWKKOSdJwpW2GEx0Ja7fmaKnMsbu+MWVZTokSYmg=
 golang.org/x/sys v0.18.0 h1:DBdB3niSjOA/O0blCZBqDefyWNYveAYMNF1Wum0DYQ4=
diff --git a/pkg/scheduler/partition.go b/pkg/scheduler/partition.go
index 5662bd75..fc2c5404 100644
--- a/pkg/scheduler/partition.go
+++ b/pkg/scheduler/partition.go
@@ -481,14 +481,19 @@ func (pc *PartitionContext) getQueueInternal(name string) 
*objects.Queue {
return queue
 }
 
-// Get the queue info for the whole queue structure to pass to the webservice
+// GetPartitionQueues builds the queue info for the whole queue structure to 
pass to the webservice
 func (pc *PartitionContext) GetPartitionQueues() dao.PartitionQueueDAOInfo {
partitionQueueDAOInfo := pc.root.GetPartitionQueueDAOInfo(true)
partitionQueueDAOInfo.Partition = 
common.GetPartitionNameWithoutClusterID(pc.Name)
return partitionQueueDAOInfo
 }
 
-// Create the recovery queue.
+// GetPlacementRules returns the current active rule set as dao to expose to 
the webservice
+func (pc *PartitionContext) GetPlacementRules() []*dao.RuleDAO {

[jira] [Resolved] (YUNIKORN-2567) Remove Application reference from applicationEvents

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2567.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove Application reference from applicationEvents
> ---
>
> Key: YUNIKORN-2567
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2567
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



(yunikorn-core) branch master updated: [YUNIKORN-2567] Remove Application reference from applicationEvents (#881)

2024-05-30 Thread pbacsko
This is an automated email from the ASF dual-hosted git repository.

pbacsko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git


The following commit(s) were added to refs/heads/master by this push:
 new 3917da66 [YUNIKORN-2567] Remove Application reference from 
applicationEvents (#881)
3917da66 is described below

commit 3917da665fc023a86f0b66e234f972cc887dba3b
Author: Peter Bacsko 
AuthorDate: Thu May 30 21:02:19 2024 +0200

[YUNIKORN-2567] Remove Application reference from applicationEvents (#881)

Closes: #881

Signed-off-by: Peter Bacsko 
---
 pkg/scheduler/objects/application.go |  12 +-
 pkg/scheduler/objects/application_events.go  |  47 
 pkg/scheduler/objects/application_events_test.go | 138 +++
 pkg/scheduler/objects/application_state.go   |   4 +-
 pkg/scheduler/objects/application_test.go|   4 +-
 pkg/scheduler/objects/queue.go   |   2 +-
 6 files changed, 77 insertions(+), 130 deletions(-)

diff --git a/pkg/scheduler/objects/application.go 
b/pkg/scheduler/objects/application.go
index b76466be..80769855 100644
--- a/pkg/scheduler/objects/application.go
+++ b/pkg/scheduler/objects/application.go
@@ -187,8 +187,8 @@ func NewApplication(siApp *si.AddApplicationRequest, ugi 
security.UserGroup, eve
app.user = ugi
app.rmEventHandler = eventHandler
app.rmID = rmID
-   app.appEvents = newApplicationEvents(app, events.GetEventSystem())
-   app.appEvents.sendNewApplicationEvent()
+   app.appEvents = newApplicationEvents(events.GetEventSystem())
+   app.appEvents.sendNewApplicationEvent(app.ApplicationID)
return app
 }
 
@@ -2106,12 +2106,12 @@ func (sa *Application) 
updateRunnableStatus(runnableInQueue, runnableByUserLimit
log.Log(log.SchedApplication).Info("Application is now 
runnable in queue",
zap.String("appID", sa.ApplicationID),
zap.String("queue", sa.queuePath))
-   sa.appEvents.sendAppRunnableInQueueEvent()
+   
sa.appEvents.sendAppRunnableInQueueEvent(sa.ApplicationID)
} else {
log.Log(log.SchedApplication).Info("Maximum number of 
running applications reached the queue limit",
zap.String("appID", sa.ApplicationID),
zap.String("queue", sa.queuePath))
-   sa.appEvents.sendAppNotRunnableInQueueEvent()
+   
sa.appEvents.sendAppNotRunnableInQueueEvent(sa.ApplicationID)
}
}
sa.runnableInQueue = runnableInQueue
@@ -2123,14 +2123,14 @@ func (sa *Application) 
updateRunnableStatus(runnableInQueue, runnableByUserLimit
zap.String("queue", sa.queuePath),
zap.String("user", sa.user.User),
zap.Strings("groups", sa.user.Groups))
-   sa.appEvents.sendAppRunnableQuotaEvent()
+   sa.appEvents.sendAppRunnableQuotaEvent(sa.ApplicationID)
} else {
log.Log(log.SchedApplication).Info("Maximum number of 
running applications reached the user/group limit",
zap.String("appID", sa.ApplicationID),
zap.String("queue", sa.queuePath),
zap.String("user", sa.user.User),
zap.Strings("groups", sa.user.Groups))
-   sa.appEvents.sendAppNotRunnableQuotaEvent()
+   
sa.appEvents.sendAppNotRunnableQuotaEvent(sa.ApplicationID)
}
}
sa.runnableByUserLimit = runnableByUserLimit
diff --git a/pkg/scheduler/objects/application_events.go 
b/pkg/scheduler/objects/application_events.go
index bb94ffbf..04fe51a0 100644
--- a/pkg/scheduler/objects/application_events.go
+++ b/pkg/scheduler/objects/application_events.go
@@ -20,7 +20,6 @@ package objects
 
 import (
"fmt"
-
"github.com/apache/yunikorn-core/pkg/common"
"github.com/apache/yunikorn-core/pkg/events"
"github.com/apache/yunikorn-scheduler-interface/lib/go/si"
@@ -28,15 +27,14 @@ import (
 
 type applicationEvents struct {
eventSystem events.EventSystem
-   app *Application
 }
 
 func (evt *applicationEvents) sendPlaceholderLargerEvent(ph *Allocation, 
request *AllocationAsk) {
if !evt.eventSystem.IsEventTrackingEnabled() {
return
}
-   message := fmt.Sprintf("Task group '%s' in application '%s': allocation 
resources '%s' are not matching placeholder '%s' allocation with ID '%s'", 
ph.GetTaskGroup(), evt.app.ApplicationID, request.GetAllocatedResource(), 
ph.GetAllocatedResource(), ph.GetAllocationKey())
-   event := 

[jira] [Created] (YUNIKORN-2652) Expand getApplication() endpoint handler to optionally return resource usage

2024-05-30 Thread Rich Scott (Jira)
Rich Scott created YUNIKORN-2652:


 Summary: Expand getApplication() endpoint handler to optionally 
return resource usage
 Key: YUNIKORN-2652
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2652
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: scheduler-interface
Reporter: Rich Scott


Some users would like to be able to see resource usage (preempted, placeholder 
resource, etc) for applications that have been completed. The 
`getApplication()` endpoint handler should be enhanced to take an optional 
parameter specifying that the user would like details about resources included 
in the response, and a new `ApplicationXXXDAOInfo` object that is a slight 
superset of `ApplicationDAOInfo` should be introduced, and can be used in the 
response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2567) Remove Application reference from applicationEvents

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2567:
-
Labels: pull-request-available  (was: )

> Remove Application reference from applicationEvents
> ---
>
> Key: YUNIKORN-2567
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2567
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2643) utils.go WaitFor improvement

2024-05-30 Thread HUAN-IU LIOU (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HUAN-IU LIOU updated YUNIKORN-2643:
---
Summary: utils.go WaitFor improvement   (was: utils.go WaitForCondition 
test coverage improvement )

> utils.go WaitFor improvement 
> -
>
> Key: YUNIKORN-2643
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2643
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: HUAN-IU LIOU
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2644) Improve FitInScore funtion's test coverage in resources.go

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2644:
---
Fix Version/s: 1.6.0

> Improve FitInScore funtion's test coverage in resources.go
> --
>
> Key: YUNIKORN-2644
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2644
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2629:
---
Fix Version/s: 1.5.2

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Comment Edited] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-30 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850682#comment-17850682
 ] 

Peter Bacsko edited comment on YUNIKORN-2629 at 5/30/24 11:04 AM:
--

FIx has been merged to branch-1.5.

For master, the change is likely going to be different.


was (Author: pbacsko):
FIx has been merged to branch-1.5.

The fix for master is likely going to be different.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-30 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850682#comment-17850682
 ] 

Peter Bacsko commented on YUNIKORN-2629:


FIx has been merged to branch-1.5.

The fix for master is likely going to be different.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2651) Update the unchecked error for make lint warnings

2024-05-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2651:
-
Labels: pull-request-available  (was: )

> Update the unchecked error for make lint warnings
> -
>
> Key: YUNIKORN-2651
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2651
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Chia-Ping Tsai
>Assignee: Yun Sun
>Priority: Major
>  Labels: pull-request-available
>
> fix the lint about "unhandled error"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2642) Don't set resources on the recovery queue

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2642.

Resolution: Fixed

> Don't set resources on the recovery queue
> -
>
> Key: YUNIKORN-2642
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2642
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
>
> The resource constrainst can be set on dynamic queues based on application 
> tags. We should not set this on the recovery queue, because there's no quota 
> on them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2642) Don't set resources on the recovery queue

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2642:
---
 Fix Version/s: 1.6.0
1.5.2
Target Version: 1.6.0, 1.5.2  (was: 1.6.0)

Merged to master & branch-1.5.

> Don't set resources on the recovery queue
> -
>
> Key: YUNIKORN-2642
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2642
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
>
> The resource constrainst can be set on dynamic queues based on application 
> tags. We should not set this on the recovery queue, because there's no quota 
> on them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



(yunikorn-core) branch branch-1.5 updated: [YUNIKORN-2642] Don't set resources on the recovery queue (#879)

2024-05-30 Thread pbacsko
This is an automated email from the ASF dual-hosted git repository.

pbacsko pushed a commit to branch branch-1.5
in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git


The following commit(s) were added to refs/heads/branch-1.5 by this push:
 new bb07ef0d [YUNIKORN-2642] Don't set resources on the recovery queue 
(#879)
bb07ef0d is described below

commit bb07ef0d22b4067539ac4d2db8a15f1bd031b40c
Author: pbacsko 
AuthorDate: Thu May 30 09:46:23 2024 +0200

[YUNIKORN-2642] Don't set resources on the recovery queue (#879)
---
 pkg/scheduler/objects/queue.go  |  5 +
 pkg/scheduler/partition_test.go | 16 
 2 files changed, 21 insertions(+)

diff --git a/pkg/scheduler/objects/queue.go b/pkg/scheduler/objects/queue.go
index 84d9d7b5..a5edece5 100644
--- a/pkg/scheduler/objects/queue.go
+++ b/pkg/scheduler/objects/queue.go
@@ -729,6 +729,11 @@ func (sq *Queue) AddApplication(app *Application) {
appID := app.ApplicationID
sq.applications[appID] = app
sq.queueEvents.sendNewApplicationEvent(sq.QueuePath, appID)
+   if common.IsRecoveryQueue(sq.QueuePath) {
+   // don't set tag-based resources on the recovery queue
+   return
+   }
+
// YUNIKORN-199: update the quota from the namespace
// get the tag with the quota
quota := app.GetTag(siCommon.AppTagNamespaceResourceQuota)
diff --git a/pkg/scheduler/partition_test.go b/pkg/scheduler/partition_test.go
index 5a275c0e..bd1b5900 100644
--- a/pkg/scheduler/partition_test.go
+++ b/pkg/scheduler/partition_test.go
@@ -979,6 +979,22 @@ func TestAddAppForced(t *testing.T) {
assert.Equal(t, common.RecoveryQueueFull, partApp3.GetQueuePath(), 
"wrong queue path for app3")
assert.Check(t, recoveryQueue == partApp3.GetQueue(), "wrong queue for 
app3")
assert.Equal(t, 3, len(recoveryQueue.GetCopyOfApps()), "wrong queue 
length")
+
+   // add recovered forced apps with resource tags
+   app4 := newApplicationTags("app-4", "default", 
common.RecoveryQueueFull, map[string]string{
+   siCommon.AppTagCreateForce: "true",
+   siCommon.AppTagNamespaceResourceGuaranteed: 
"{\"resources\":{\"vcore\":{\"value\":111}}}"})
+   err = partition.AddApplication(app4)
+   assert.NilError(t, err, "app4 could not be added")
+   assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, 
"guaranteed resource should be unset")
+   assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource 
should be unset")
+   app5 := newApplicationTags("app-5", "default", 
common.RecoveryQueueFull, map[string]string{
+   siCommon.AppTagCreateForce:"true",
+   siCommon.AppTagNamespaceResourceQuota: 
"{\"resources\":{\"vcore\":{\"value\":111}}}"})
+   err = partition.AddApplication(app5)
+   assert.NilError(t, err, "app5 could not be added")
+   assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, 
"guaranteed resource should be unset")
+   assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource 
should be unset")
 }
 
 func TestAddAppForcedWithPlacement(t *testing.T) {


-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



(yunikorn-core) branch master updated: [YUNIKORN-2642] Don't set resources on the recovery queue (#878)

2024-05-30 Thread pbacsko
This is an automated email from the ASF dual-hosted git repository.

pbacsko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git


The following commit(s) were added to refs/heads/master by this push:
 new 37d7e5c5 [YUNIKORN-2642] Don't set resources on the recovery queue 
(#878)
37d7e5c5 is described below

commit 37d7e5c5431c3aec9686b8bfb787cdccf0834549
Author: Peter Bacsko 
AuthorDate: Thu May 30 09:47:13 2024 +0200

[YUNIKORN-2642] Don't set resources on the recovery queue (#878)

Closes: #878

Signed-off-by: Peter Bacsko 
---
 pkg/scheduler/partition.go  |  5 +++--
 pkg/scheduler/partition_test.go | 16 
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/pkg/scheduler/partition.go b/pkg/scheduler/partition.go
index 11ee05ae..5662bd75 100644
--- a/pkg/scheduler/partition.go
+++ b/pkg/scheduler/partition.go
@@ -320,8 +320,9 @@ func (pc *PartitionContext) AddApplication(app 
*objects.Application) error {
queue := pc.getQueueInternal(queueName)
 
// create the queue if necessary
+   isRecoveryQueue := common.IsRecoveryQueue(queueName)
if queue == nil {
-   if common.IsRecoveryQueue(queueName) {
+   if isRecoveryQueue {
queue, err = pc.createRecoveryQueue()
if err != nil {
return fmt.Errorf("failed to create recovery 
queue %s for application %s", common.RecoveryQueueFull, appID)
@@ -341,7 +342,7 @@ func (pc *PartitionContext) AddApplication(app 
*objects.Application) error {
 
guaranteedRes := app.GetGuaranteedResource()
maxRes := app.GetMaxResource()
-   if guaranteedRes != nil || maxRes != nil {
+   if !isRecoveryQueue && (guaranteedRes != nil || maxRes != nil) {
// set resources based on tags, but only if the queue is 
dynamic (unmanaged)
if queue.IsManaged() {
log.Log(log.SchedQueue).Warn("Trying to set resources 
on a queue that is not an unmanaged leaf",
diff --git a/pkg/scheduler/partition_test.go b/pkg/scheduler/partition_test.go
index eabb1ad4..e99d5761 100644
--- a/pkg/scheduler/partition_test.go
+++ b/pkg/scheduler/partition_test.go
@@ -964,6 +964,22 @@ func TestAddAppForced(t *testing.T) {
assert.Equal(t, common.RecoveryQueueFull, partApp3.GetQueuePath(), 
"wrong queue path for app3")
assert.Check(t, recoveryQueue == partApp3.GetQueue(), "wrong queue for 
app3")
assert.Equal(t, 3, len(recoveryQueue.GetCopyOfApps()), "wrong queue 
length")
+
+   // add recovered forced apps with resource tags
+   app4 := newApplicationTags("app-4", "default", 
common.RecoveryQueueFull, map[string]string{
+   siCommon.AppTagCreateForce: "true",
+   siCommon.AppTagNamespaceResourceGuaranteed: 
"{\"resources\":{\"vcore\":{\"value\":111}}}"})
+   err = partition.AddApplication(app4)
+   assert.NilError(t, err, "app4 could not be added")
+   assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, 
"guaranteed resource should be unset")
+   assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource 
should be unset")
+   app5 := newApplicationTags("app-5", "default", 
common.RecoveryQueueFull, map[string]string{
+   siCommon.AppTagCreateForce:"true",
+   siCommon.AppTagNamespaceResourceQuota: 
"{\"resources\":{\"vcore\":{\"value\":111}}}"})
+   err = partition.AddApplication(app5)
+   assert.NilError(t, err, "app5 could not be added")
+   assert.Assert(t, recoveryQueue.GetGuaranteedResource() == nil, 
"guaranteed resource should be unset")
+   assert.Assert(t, recoveryQueue.GetMaxResource() == nil, "max resource 
should be unset")
 }
 
 func TestAddAppForcedWithPlacement(t *testing.T) {


-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org