[jira] [Updated] (YUNIKORN-2641) Ensure createTime has same semantics for ask and allocation

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2641:
-
Labels: pull-request-available  (was: )

> Ensure createTime has same semantics for ask and allocation
> ---
>
> Key: YUNIKORN-2641
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2641
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Craig Condit
>Assignee: Craig Condit
>Priority: Major
>  Labels: pull-request-available
>
> The createTime field in Allocation and AllocationAsk are not used 
> consistently. Ensure that the field is always set, and that it is not 
> modified later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2641) Ensure createTime has same semantics for ask and allocation

2024-05-23 Thread Craig Condit (Jira)
Craig Condit created YUNIKORN-2641:
--

 Summary: Ensure createTime has same semantics for ask and 
allocation
 Key: YUNIKORN-2641
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2641
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - scheduler
Reporter: Craig Condit
Assignee: Craig Condit


The createTime field in Allocation and AllocationAsk are not used consistently. 
Ensure that the field is always set, and that it is not modified later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2633:
-
Labels: pull-request-available  (was: )

> Unnecessary warning from Partition when adding an application
> -
>
> Key: YUNIKORN-2633
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2633
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>
> The following is printed when adding an application:
> {noformat}
> 2024-05-17T21:53:04.716+0200  WARNcore.scheduler.queue
> scheduler/partition.go:344  Trying to set resources on a queue that is 
> not an unmanaged leaf{"queueName": "root.default"}
> {noformat}
> This message is supposed to be printed when the application defines a 
> guaranteed or max resource. After YUNIKORN-2547 it's always printed if the 
> queue is managed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package

2024-05-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2564:
---
Target Version: 1.6.0

> [Umbrella] Move xxxEvents types to a different package
> --
>
> Key: YUNIKORN-2564
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2564
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> There are several Events that can be moved to a different package:
> * queueEvents
> * applicationEvents
> * askEvents
> * nodeEvents
> There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity 
> to clean it up a bit and move these under eg. 
> {{pkg/scheduler/objects/events}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents

2024-05-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2566.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove AllocationAsk reference from askEvents
> -
>
> Key: YUNIKORN-2566
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2566
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2565) Remove Node reference from nodeEvents

2024-05-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2565.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove Node reference from nodeEvents
> -
>
> Key: YUNIKORN-2565
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2565
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



(yunikorn-core) branch master updated: [YUNIKORN-2566] Remove AllocationAsk reference from askEvents (#868)

2024-05-23 Thread pbacsko
This is an automated email from the ASF dual-hosted git repository.

pbacsko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git


The following commit(s) were added to refs/heads/master by this push:
 new b0721e24 [YUNIKORN-2566] Remove AllocationAsk reference from askEvents 
(#868)
b0721e24 is described below

commit b0721e242b4936c4801aea3e885f8e796d5f4f89
Author: Peter Bacsko 
AuthorDate: Thu May 23 12:31:16 2024 +0200

[YUNIKORN-2566] Remove AllocationAsk reference from askEvents (#868)

Closes: #868

Signed-off-by: Peter Bacsko 
---
 pkg/scheduler/objects/allocation_ask.go  | 14 ++---
 pkg/scheduler/objects/allocation_ask_test.go |  4 +-
 pkg/scheduler/objects/application_test.go|  4 +-
 pkg/scheduler/objects/ask_events.go  | 38 +++--
 pkg/scheduler/objects/ask_events_test.go | 80 +++-
 pkg/scheduler/objects/node_test.go   |  2 +-
 6 files changed, 60 insertions(+), 82 deletions(-)

diff --git a/pkg/scheduler/objects/allocation_ask.go 
b/pkg/scheduler/objects/allocation_ask.go
index 4701beae..e213a1b6 100644
--- a/pkg/scheduler/objects/allocation_ask.go
+++ b/pkg/scheduler/objects/allocation_ask.go
@@ -77,9 +77,9 @@ func NewAllocationAsk(allocationKey string, applicationID 
string, allocatedResou
allocatedResource: allocatedResource,
allocLog:  make(map[string]*AllocationLogEntry),
resKeyPerNode: make(map[string]string),
+   askEvents: newAskEvents(events.GetEventSystem()),
}
aa.resKeyWithoutNode = reservationKeyWithoutNode(applicationID, 
allocationKey)
-   aa.askEvents = newAskEvents(aa, events.GetEventSystem())
return aa
 }
 
@@ -99,6 +99,7 @@ func NewAllocationAskFromSI(ask *si.AllocationAsk) 
*AllocationAsk {
originator:ask.Originator,
allocLog:  make(map[string]*AllocationLogEntry),
resKeyPerNode: make(map[string]string),
+   askEvents: newAskEvents(events.GetEventSystem()),
}
// this is a safety check placeholder and task group name must be set 
as a combo
// order is important as task group can be set without placeholder but 
not the other way around
@@ -108,7 +109,6 @@ func NewAllocationAskFromSI(ask *si.AllocationAsk) 
*AllocationAsk {
return nil
}
saa.resKeyWithoutNode = reservationKeyWithoutNode(ask.ApplicationID, 
ask.AllocationKey)
-   saa.askEvents = newAskEvents(saa, events.GetEventSystem())
return saa
 }
 
@@ -260,7 +260,7 @@ func (aa *AllocationAsk) LogAllocationFailure(message 
string, allocate bool) {
 }
 
 func (aa *AllocationAsk) SendPredicateFailedEvent(message string) {
-   aa.askEvents.sendPredicateFailed(message)
+   aa.askEvents.sendPredicateFailed(aa.allocationKey, aa.applicationID, 
message, aa.GetAllocatedResource())
 }
 
 // GetAllocationLog returns a list of log entries corresponding to allocation 
preconditions not being met
@@ -344,7 +344,7 @@ func (aa *AllocationAsk) setHeadroomCheckFailed(headroom 
*resources.Resource, qu
defer aa.Unlock()
if !aa.headroomCheckFailed {
aa.headroomCheckFailed = true
-   aa.askEvents.sendRequestExceedsQueueHeadroom(headroom, queue)
+   aa.askEvents.sendRequestExceedsQueueHeadroom(aa.allocationKey, 
aa.applicationID, headroom, aa.allocatedResource, queue)
}
 }
 
@@ -353,7 +353,7 @@ func (aa *AllocationAsk) setHeadroomCheckPassed(queue 
string) {
defer aa.Unlock()
if aa.headroomCheckFailed {
aa.headroomCheckFailed = false
-   aa.askEvents.sendRequestFitsInQueue(queue)
+   aa.askEvents.sendRequestFitsInQueue(aa.allocationKey, 
aa.applicationID, queue, aa.allocatedResource)
}
 }
 
@@ -362,7 +362,7 @@ func (aa *AllocationAsk) setUserQuotaCheckFailed(available 
*resources.Resource)
defer aa.Unlock()
if !aa.userQuotaCheckFailed {
aa.userQuotaCheckFailed = true
-   aa.askEvents.sendRequestExceedsUserQuota(available)
+   aa.askEvents.sendRequestExceedsUserQuota(aa.allocationKey, 
aa.applicationID, available, aa.allocatedResource)
}
 }
 
@@ -371,6 +371,6 @@ func (aa *AllocationAsk) setUserQuotaCheckPassed() {
defer aa.Unlock()
if aa.userQuotaCheckFailed {
aa.userQuotaCheckFailed = false
-   aa.askEvents.sendRequestFitsInUserQuota()
+   aa.askEvents.sendRequestFitsInUserQuota(aa.allocationKey, 
aa.applicationID, aa.allocatedResource)
}
 }
diff --git a/pkg/scheduler/objects/allocation_ask_test.go 
b/pkg/scheduler/objects/allocation_ask_test.go
index b2667c7d..a9739754 100644
--- a/pkg/scheduler/objects/allocation_ask_test.go
+++ b/pkg/scheduler/objects/allocation_ask_test.go
@@ -213,12 +213,12 @@ 

(yunikorn-core) branch master updated: [YUNIKORN-2565] Remove Node reference from nodeEvents (#867)

2024-05-23 Thread pbacsko
This is an automated email from the ASF dual-hosted git repository.

pbacsko pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git


The following commit(s) were added to refs/heads/master by this push:
 new b6809ea8 [YUNIKORN-2565] Remove Node reference from nodeEvents (#867)
b6809ea8 is described below

commit b6809ea80d4459705ea740aa574ec4ef48aea829
Author: Peter Bacsko 
AuthorDate: Thu May 23 12:24:03 2024 +0200

[YUNIKORN-2565] Remove Node reference from nodeEvents (#867)

Closes: #867

Signed-off-by: Peter Bacsko 
---
 pkg/scheduler/objects/node.go |  22 +++---
 pkg/scheduler/objects/node_events.go  |  46 ++---
 pkg/scheduler/objects/node_events_test.go | 111 +++---
 pkg/scheduler/objects/node_test.go|   2 +-
 pkg/scheduler/objects/utilities_test.go   |   2 +-
 5 files changed, 77 insertions(+), 106 deletions(-)

diff --git a/pkg/scheduler/objects/node.go b/pkg/scheduler/objects/node.go
index 6ea167ef..09efa2d3 100644
--- a/pkg/scheduler/objects/node.go
+++ b/pkg/scheduler/objects/node.go
@@ -77,7 +77,7 @@ func NewNode(proto *si.NodeInfo) *Node {
schedulable:   true,
listeners: make([]NodeListener, 0),
}
-   sn.nodeEvents = newNodeEvents(sn, events.GetEventSystem())
+   sn.nodeEvents = newNodeEvents(events.GetEventSystem())
// initialise available resources
var err error
sn.availableResource, err = 
resources.SubErrorNegative(sn.totalResource, sn.occupiedResource)
@@ -165,7 +165,7 @@ func (sn *Node) SetCapacity(newCapacity 
*resources.Resource) *resources.Resource
delta := resources.Sub(newCapacity, sn.totalResource)
sn.totalResource = newCapacity
sn.refreshAvailableResource()
-   sn.nodeEvents.sendNodeCapacityChangedEvent()
+   sn.nodeEvents.sendNodeCapacityChangedEvent(sn.NodeID, 
sn.totalResource.Clone())
return delta
 }
 
@@ -184,7 +184,7 @@ func (sn *Node) SetOccupiedResource(occupiedResource 
*resources.Resource) {
return
}
sn.occupiedResource = occupiedResource
-   sn.nodeEvents.sendNodeOccupiedResourceChangedEvent()
+   sn.nodeEvents.sendNodeOccupiedResourceChangedEvent(sn.NodeID, 
sn.occupiedResource.Clone())
sn.refreshAvailableResource()
 }
 
@@ -234,7 +234,7 @@ func (sn *Node) SetSchedulable(schedulable bool) {
sn.Lock()
defer sn.Unlock()
sn.schedulable = schedulable
-   sn.nodeEvents.sendNodeSchedulableChangedEvent(sn.schedulable)
+   sn.nodeEvents.sendNodeSchedulableChangedEvent(sn.NodeID, sn.schedulable)
 }
 
 // Can this node be used in scheduling.
@@ -304,7 +304,7 @@ func (sn *Node) RemoveAllocation(allocationKey string) 
*Allocation {
delete(sn.allocations, allocationKey)
sn.allocatedResource.SubFrom(alloc.GetAllocatedResource())
sn.availableResource.AddTo(alloc.GetAllocatedResource())
-   sn.nodeEvents.sendAllocationRemovedEvent(alloc.allocationKey, 
alloc.allocatedResource)
+   sn.nodeEvents.sendAllocationRemovedEvent(sn.NodeID, 
alloc.allocationKey, alloc.allocatedResource)
return alloc
}
 
@@ -327,7 +327,7 @@ func (sn *Node) AddAllocation(alloc *Allocation) bool {
sn.allocations[alloc.GetAllocationKey()] = alloc
sn.allocatedResource.AddTo(res)
sn.availableResource.SubFrom(res)
-   sn.nodeEvents.sendAllocationAddedEvent(alloc.allocationKey, res)
+   sn.nodeEvents.sendAllocationAddedEvent(sn.NodeID, 
alloc.allocationKey, res)
return true
}
return false
@@ -490,7 +490,7 @@ func (sn *Node) Reserve(app *Application, ask 
*AllocationAsk) error {
return fmt.Errorf("reservation does not fit on node %s, appID 
%s, ask %s", sn.NodeID, app.ApplicationID, ask.GetAllocatedResource().String())
}
sn.reservations[appReservation.getKey()] = appReservation
-   sn.nodeEvents.sendReservedEvent(ask.GetAllocatedResource(), 
ask.GetAllocationKey())
+   sn.nodeEvents.sendReservedEvent(sn.NodeID, ask.GetAllocatedResource(), 
ask.GetAllocationKey())
// reservation added successfully
return nil
 }
@@ -512,7 +512,7 @@ func (sn *Node) unReserve(app *Application, ask 
*AllocationAsk) (int, error) {
}
if _, ok := sn.reservations[resKey]; ok {
delete(sn.reservations, resKey)
-   sn.nodeEvents.sendUnreservedEvent(ask.GetAllocatedResource(), 
ask.GetAllocationKey())
+   sn.nodeEvents.sendUnreservedEvent(sn.NodeID, 
ask.GetAllocatedResource(), ask.GetAllocationKey())
return 1, nil
}
// reservation was not found
@@ -587,9 +587,11 @@ func (sn *Node) getListeners() []NodeListener {
 }
 
 func (sn *Node) SendNodeAddedEvent() {
-   

[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2629:
-
Labels: pull-request-available  (was: )

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848895#comment-17848895
 ] 

Peter Bacsko commented on YUNIKORN-2629:


Note: I have a solution for this, which basically just unlocks/re-acquires the 
mutex: 
https://github.com/pbacsko/incubator-yunikorn-k8shim/commit/4853d08d2dc310d9eed0426203a4df7b3c6ba73a.
I don't know if this is the approach we want to follow, but this solves the 
problem - it's been tested.

> Adding a node can result in a deadlock
> --
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Affects Versions: 1.5.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>   nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>   if !ok {
>   return
>   }
>   [...] removed for clarity
>   wg.Done()
>   })
>   defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>   if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({
>   Nodes: nodesToRegister,
>   RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>   }); err != nil {
>   log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>   return nil, err
>   }
>   // wait for all responses to accumulate
>   wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>   for {
>   select {
>   case event := <-getDispatcher().eventChan:
>   switch v := event.(type) {
>   case events.TaskEvent:
>   getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>   case events.ApplicationEvent:
>   getEventHandler(EventTypeApp)(v)
>   case events.SchedulerNodeEvent:
>   getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-23 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848889#comment-17848889
 ] 

Peter Bacsko commented on YUNIKORN-2637:


OK. Let's get this in and we'll see what happens then.

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-2569) Helm upgrade behaviour

2024-05-23 Thread Tseng Hsi-Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tseng Hsi-Huang reassigned YUNIKORN-2569:
-

Assignee: Tseng Hsi-Huang

> Helm upgrade behaviour
> --
>
> Key: YUNIKORN-2569
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2569
> Project: Apache YuniKorn
>  Issue Type: Test
>Reporter: Manikandan R
>Assignee: Tseng Hsi-Huang
>Priority: Major
>
> Need to test the Yunikorn upgrade behaviour through Helm.
> For example, 
> 1. Create cluster using kind create.
> 2. Deploy old versions of Yunikorn (say, 1.2 or 1.3 or 1.4) using helm deploy.
> 3. Sanity checks to ensure deployed version is working as expected.
> 4. Upgrade YK version to the latest master (1.6) using helm upgrade.
> 5. Document the behaviour especially when there are any issues.
> Repeat for each old versions (1.2, 1.3 and 1.4).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848863#comment-17848863
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2637:
-

The case we are solving here is the correct removal of a pod that was 
registered and then stopped. In this case if the pod was assigned to a node it 
gets removed, this includes from the core also. In the case that it was not 
assigned to a node the request gets removed. I think both core and k8shim are 
affected by this after looking at the details in YUNIKORN-2526.

So I am not sure if that is the root cause of the difference...

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Assigned] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YUNIKORN-2637:
--

Assignee: Wilfred Spiegelenburg

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2636) Admission Controller ignores existing Queue/ApplicationID annotations

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2636:
-
Labels: pull-request-available  (was: )

> Admission Controller ignores existing Queue/ApplicationID annotations
> -
>
> Key: YUNIKORN-2636
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2636
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Yu-Lin Chen
>Assignee: Yu-Lin Chen
>Priority: Major
>  Labels: pull-request-available
>
> Admission controller should patch missing queueName/applicationID to pod if 
> they were not set.
> However, if the queueName/applicationID were only set in annotation, AM will 
> still patch the default value to pod. The updatePodLabel() only check 
> existing labels.
>  * 
> [https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/util.go#L39-L40]
>  * 
> [https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/admission/util.go#L53]
>  
> {*}What will impact users{*}:
> When we set placement rule to"provided" and queueName/applicationID were only 
> set in annotation.
> The application will be submitted to default queue instead the queue set in 
> the pod anonotation.
> {*}Steps to reproduce the issue{*}:
> 1. Update placement rule to 'provided'
> {code:java}
> apiVersion: v1
> kind: ConfigMap
> metadata:
>   name: yunikorn-configs
>   namespace: yunikorn
> data:
>   log.level: "DEBUG"
>   queues.yaml: |
> partitions:
>   - name: default
> queues:
>   - name: root
> submitacl: '*'
> placementrules:
>   - name: provided
> create: true  {code}
> 2. Create pod with queue/app-id annotations 
> {code:java}
> apiVersion: v1
> kind: Pod
> metadata:
>   labels:
> app: sleep
>   annotations:
> yunikorn.apache.org/app-id: "application-sleep-0001"
> yunikorn.apache.org/queue: "root.sandbox"
>   name: pod-with-annotations-only
> spec:
>   schedulerName: yunikorn
>   restartPolicy: Never
>   containers:
> - name: sleep-6000s
>   image: "alpine:latest"
>   command: ["sleep", "6000"]
>   resources:
> requests:
>   cpu: "100m"
>   memory: "500M"{code}
> 3. Check below labels in pod, thees are added by AM.
>  * queue=root.default
>  * applicationId=yunikorn-default-autogen
> 4. Check the app's queue:
>  * 
> [http://localhost:9889/ws/v1/partition/default/application/application-sleep-0001]
>  
> The app is in "root.default", but it should be in "root.sandbox". 
> (Please check the queue ordering: 
> [Doc|https://yunikorn.apache.org/docs/user_guide/labels_and_annotations_in_yunikorn#annotations-in-yunikorn])
> To fix this issue, we should let AM check all the possible locations.
> Notes: As part of the future change in YUNIKORN-2504, If the same metadata 
> (queue or applicationID) is specified in multiple locations and their values 
> are inconsistent, the pod should be reject.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2631) Support canonical labels for queue/applicationId in Admission Controller

2024-05-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YUNIKORN-2631:

Labels: pull-request-available release-notes  (was: pull-request-available)

> Support canonical labels for queue/applicationId in Admission Controller
> 
>
> Key: YUNIKORN-2631
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2631
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: shim - kubernetes
>Reporter: Yu-Lin Chen
>Assignee: Yu-Lin Chen
>Priority: Major
>  Labels: pull-request-available, release-notes
>
> Admission controller adds applicationID and label to Pod if they are not 
> already set in the Pod.
> According to the new policy defined in YUNIKORN-1351.
> Admission Controller will change to patch canonical label/annotation in the 
> future releases.
>  * yunikorn.apache.org/app-id (Canonical Label)
>  * yunikorn.apache.org/queue  (Canonical Label)
> To avoid an upgrade problem where the admission controller gets started 
> first, AM needs to generate both canonical/non-canonical labels in 1.6.0. 
> (This ensures that the 1.5.0 scheduler could understand labels generated in 
> the 1.6.0 admission controller)  In 1.7.0, we can switch to generating only 
> the canonical label in AM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org



[jira] [Updated] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-05-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YUNIKORN-2637:
-
Labels: pull-request-available  (was: )

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org