[jira] [Updated] (YUNIKORN-2641) Ensure createTime has same semantics for ask and allocation
[ https://issues.apache.org/jira/browse/YUNIKORN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2641: - Labels: pull-request-available (was: ) > Ensure createTime has same semantics for ask and allocation > --- > > Key: YUNIKORN-2641 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2641 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Craig Condit >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > > The createTime field in Allocation and AllocationAsk are not used > consistently. Ensure that the field is always set, and that it is not > modified later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Created] (YUNIKORN-2641) Ensure createTime has same semantics for ask and allocation
Craig Condit created YUNIKORN-2641: -- Summary: Ensure createTime has same semantics for ask and allocation Key: YUNIKORN-2641 URL: https://issues.apache.org/jira/browse/YUNIKORN-2641 Project: Apache YuniKorn Issue Type: Sub-task Components: core - scheduler Reporter: Craig Condit Assignee: Craig Condit The createTime field in Allocation and AllocationAsk are not used consistently. Ensure that the field is always set, and that it is not modified later. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application
[ https://issues.apache.org/jira/browse/YUNIKORN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2633: - Labels: pull-request-available (was: ) > Unnecessary warning from Partition when adding an application > - > > Key: YUNIKORN-2633 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2633 > Project: Apache YuniKorn > Issue Type: Bug > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > > The following is printed when adding an application: > {noformat} > 2024-05-17T21:53:04.716+0200 WARNcore.scheduler.queue > scheduler/partition.go:344 Trying to set resources on a queue that is > not an unmanaged leaf{"queueName": "root.default"} > {noformat} > This message is supposed to be printed when the application defines a > guaranteed or max resource. After YUNIKORN-2547 it's always printed if the > queue is managed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package
[ https://issues.apache.org/jira/browse/YUNIKORN-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YUNIKORN-2564: --- Target Version: 1.6.0 > [Umbrella] Move xxxEvents types to a different package > -- > > Key: YUNIKORN-2564 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2564 > Project: Apache YuniKorn > Issue Type: Improvement > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > There are several Events that can be moved to a different package: > * queueEvents > * applicationEvents > * askEvents > * nodeEvents > There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity > to clean it up a bit and move these under eg. > {{pkg/scheduler/objects/events}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2566. Fix Version/s: 1.6.0 Resolution: Fixed > Remove AllocationAsk reference from askEvents > - > > Key: YUNIKORN-2566 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2566 > Project: Apache YuniKorn > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Resolved] (YUNIKORN-2565) Remove Node reference from nodeEvents
[ https://issues.apache.org/jira/browse/YUNIKORN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko resolved YUNIKORN-2565. Fix Version/s: 1.6.0 Resolution: Fixed > Remove Node reference from nodeEvents > - > > Key: YUNIKORN-2565 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2565 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: core - scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Labels: pull-request-available > Fix For: 1.6.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
(yunikorn-core) branch master updated: [YUNIKORN-2566] Remove AllocationAsk reference from askEvents (#868)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/master by this push: new b0721e24 [YUNIKORN-2566] Remove AllocationAsk reference from askEvents (#868) b0721e24 is described below commit b0721e242b4936c4801aea3e885f8e796d5f4f89 Author: Peter Bacsko AuthorDate: Thu May 23 12:31:16 2024 +0200 [YUNIKORN-2566] Remove AllocationAsk reference from askEvents (#868) Closes: #868 Signed-off-by: Peter Bacsko --- pkg/scheduler/objects/allocation_ask.go | 14 ++--- pkg/scheduler/objects/allocation_ask_test.go | 4 +- pkg/scheduler/objects/application_test.go| 4 +- pkg/scheduler/objects/ask_events.go | 38 +++-- pkg/scheduler/objects/ask_events_test.go | 80 +++- pkg/scheduler/objects/node_test.go | 2 +- 6 files changed, 60 insertions(+), 82 deletions(-) diff --git a/pkg/scheduler/objects/allocation_ask.go b/pkg/scheduler/objects/allocation_ask.go index 4701beae..e213a1b6 100644 --- a/pkg/scheduler/objects/allocation_ask.go +++ b/pkg/scheduler/objects/allocation_ask.go @@ -77,9 +77,9 @@ func NewAllocationAsk(allocationKey string, applicationID string, allocatedResou allocatedResource: allocatedResource, allocLog: make(map[string]*AllocationLogEntry), resKeyPerNode: make(map[string]string), + askEvents: newAskEvents(events.GetEventSystem()), } aa.resKeyWithoutNode = reservationKeyWithoutNode(applicationID, allocationKey) - aa.askEvents = newAskEvents(aa, events.GetEventSystem()) return aa } @@ -99,6 +99,7 @@ func NewAllocationAskFromSI(ask *si.AllocationAsk) *AllocationAsk { originator:ask.Originator, allocLog: make(map[string]*AllocationLogEntry), resKeyPerNode: make(map[string]string), + askEvents: newAskEvents(events.GetEventSystem()), } // this is a safety check placeholder and task group name must be set as a combo // order is important as task group can be set without placeholder but not the other way around @@ -108,7 +109,6 @@ func NewAllocationAskFromSI(ask *si.AllocationAsk) *AllocationAsk { return nil } saa.resKeyWithoutNode = reservationKeyWithoutNode(ask.ApplicationID, ask.AllocationKey) - saa.askEvents = newAskEvents(saa, events.GetEventSystem()) return saa } @@ -260,7 +260,7 @@ func (aa *AllocationAsk) LogAllocationFailure(message string, allocate bool) { } func (aa *AllocationAsk) SendPredicateFailedEvent(message string) { - aa.askEvents.sendPredicateFailed(message) + aa.askEvents.sendPredicateFailed(aa.allocationKey, aa.applicationID, message, aa.GetAllocatedResource()) } // GetAllocationLog returns a list of log entries corresponding to allocation preconditions not being met @@ -344,7 +344,7 @@ func (aa *AllocationAsk) setHeadroomCheckFailed(headroom *resources.Resource, qu defer aa.Unlock() if !aa.headroomCheckFailed { aa.headroomCheckFailed = true - aa.askEvents.sendRequestExceedsQueueHeadroom(headroom, queue) + aa.askEvents.sendRequestExceedsQueueHeadroom(aa.allocationKey, aa.applicationID, headroom, aa.allocatedResource, queue) } } @@ -353,7 +353,7 @@ func (aa *AllocationAsk) setHeadroomCheckPassed(queue string) { defer aa.Unlock() if aa.headroomCheckFailed { aa.headroomCheckFailed = false - aa.askEvents.sendRequestFitsInQueue(queue) + aa.askEvents.sendRequestFitsInQueue(aa.allocationKey, aa.applicationID, queue, aa.allocatedResource) } } @@ -362,7 +362,7 @@ func (aa *AllocationAsk) setUserQuotaCheckFailed(available *resources.Resource) defer aa.Unlock() if !aa.userQuotaCheckFailed { aa.userQuotaCheckFailed = true - aa.askEvents.sendRequestExceedsUserQuota(available) + aa.askEvents.sendRequestExceedsUserQuota(aa.allocationKey, aa.applicationID, available, aa.allocatedResource) } } @@ -371,6 +371,6 @@ func (aa *AllocationAsk) setUserQuotaCheckPassed() { defer aa.Unlock() if aa.userQuotaCheckFailed { aa.userQuotaCheckFailed = false - aa.askEvents.sendRequestFitsInUserQuota() + aa.askEvents.sendRequestFitsInUserQuota(aa.allocationKey, aa.applicationID, aa.allocatedResource) } } diff --git a/pkg/scheduler/objects/allocation_ask_test.go b/pkg/scheduler/objects/allocation_ask_test.go index b2667c7d..a9739754 100644 --- a/pkg/scheduler/objects/allocation_ask_test.go +++ b/pkg/scheduler/objects/allocation_ask_test.go @@ -213,12 +213,12 @@
(yunikorn-core) branch master updated: [YUNIKORN-2565] Remove Node reference from nodeEvents (#867)
This is an automated email from the ASF dual-hosted git repository. pbacsko pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/yunikorn-core.git The following commit(s) were added to refs/heads/master by this push: new b6809ea8 [YUNIKORN-2565] Remove Node reference from nodeEvents (#867) b6809ea8 is described below commit b6809ea80d4459705ea740aa574ec4ef48aea829 Author: Peter Bacsko AuthorDate: Thu May 23 12:24:03 2024 +0200 [YUNIKORN-2565] Remove Node reference from nodeEvents (#867) Closes: #867 Signed-off-by: Peter Bacsko --- pkg/scheduler/objects/node.go | 22 +++--- pkg/scheduler/objects/node_events.go | 46 ++--- pkg/scheduler/objects/node_events_test.go | 111 +++--- pkg/scheduler/objects/node_test.go| 2 +- pkg/scheduler/objects/utilities_test.go | 2 +- 5 files changed, 77 insertions(+), 106 deletions(-) diff --git a/pkg/scheduler/objects/node.go b/pkg/scheduler/objects/node.go index 6ea167ef..09efa2d3 100644 --- a/pkg/scheduler/objects/node.go +++ b/pkg/scheduler/objects/node.go @@ -77,7 +77,7 @@ func NewNode(proto *si.NodeInfo) *Node { schedulable: true, listeners: make([]NodeListener, 0), } - sn.nodeEvents = newNodeEvents(sn, events.GetEventSystem()) + sn.nodeEvents = newNodeEvents(events.GetEventSystem()) // initialise available resources var err error sn.availableResource, err = resources.SubErrorNegative(sn.totalResource, sn.occupiedResource) @@ -165,7 +165,7 @@ func (sn *Node) SetCapacity(newCapacity *resources.Resource) *resources.Resource delta := resources.Sub(newCapacity, sn.totalResource) sn.totalResource = newCapacity sn.refreshAvailableResource() - sn.nodeEvents.sendNodeCapacityChangedEvent() + sn.nodeEvents.sendNodeCapacityChangedEvent(sn.NodeID, sn.totalResource.Clone()) return delta } @@ -184,7 +184,7 @@ func (sn *Node) SetOccupiedResource(occupiedResource *resources.Resource) { return } sn.occupiedResource = occupiedResource - sn.nodeEvents.sendNodeOccupiedResourceChangedEvent() + sn.nodeEvents.sendNodeOccupiedResourceChangedEvent(sn.NodeID, sn.occupiedResource.Clone()) sn.refreshAvailableResource() } @@ -234,7 +234,7 @@ func (sn *Node) SetSchedulable(schedulable bool) { sn.Lock() defer sn.Unlock() sn.schedulable = schedulable - sn.nodeEvents.sendNodeSchedulableChangedEvent(sn.schedulable) + sn.nodeEvents.sendNodeSchedulableChangedEvent(sn.NodeID, sn.schedulable) } // Can this node be used in scheduling. @@ -304,7 +304,7 @@ func (sn *Node) RemoveAllocation(allocationKey string) *Allocation { delete(sn.allocations, allocationKey) sn.allocatedResource.SubFrom(alloc.GetAllocatedResource()) sn.availableResource.AddTo(alloc.GetAllocatedResource()) - sn.nodeEvents.sendAllocationRemovedEvent(alloc.allocationKey, alloc.allocatedResource) + sn.nodeEvents.sendAllocationRemovedEvent(sn.NodeID, alloc.allocationKey, alloc.allocatedResource) return alloc } @@ -327,7 +327,7 @@ func (sn *Node) AddAllocation(alloc *Allocation) bool { sn.allocations[alloc.GetAllocationKey()] = alloc sn.allocatedResource.AddTo(res) sn.availableResource.SubFrom(res) - sn.nodeEvents.sendAllocationAddedEvent(alloc.allocationKey, res) + sn.nodeEvents.sendAllocationAddedEvent(sn.NodeID, alloc.allocationKey, res) return true } return false @@ -490,7 +490,7 @@ func (sn *Node) Reserve(app *Application, ask *AllocationAsk) error { return fmt.Errorf("reservation does not fit on node %s, appID %s, ask %s", sn.NodeID, app.ApplicationID, ask.GetAllocatedResource().String()) } sn.reservations[appReservation.getKey()] = appReservation - sn.nodeEvents.sendReservedEvent(ask.GetAllocatedResource(), ask.GetAllocationKey()) + sn.nodeEvents.sendReservedEvent(sn.NodeID, ask.GetAllocatedResource(), ask.GetAllocationKey()) // reservation added successfully return nil } @@ -512,7 +512,7 @@ func (sn *Node) unReserve(app *Application, ask *AllocationAsk) (int, error) { } if _, ok := sn.reservations[resKey]; ok { delete(sn.reservations, resKey) - sn.nodeEvents.sendUnreservedEvent(ask.GetAllocatedResource(), ask.GetAllocationKey()) + sn.nodeEvents.sendUnreservedEvent(sn.NodeID, ask.GetAllocatedResource(), ask.GetAllocationKey()) return 1, nil } // reservation was not found @@ -587,9 +587,11 @@ func (sn *Node) getListeners() []NodeListener { } func (sn *Node) SendNodeAddedEvent() { -
[jira] [Updated] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2629: - Labels: pull-request-available (was: ) > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2629) Adding a node can result in a deadlock
[ https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848895#comment-17848895 ] Peter Bacsko commented on YUNIKORN-2629: Note: I have a solution for this, which basically just unlocks/re-acquires the mutex: https://github.com/pbacsko/incubator-yunikorn-k8shim/commit/4853d08d2dc310d9eed0426203a4df7b3c6ba73a. I don't know if this is the approach we want to follow, but this solves the problem - it's been tested. > Adding a node can result in a deadlock > -- > > Key: YUNIKORN-2629 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2629 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Affects Versions: 1.5.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Blocker > Attachments: updateNode_deadlock_trace.txt > > > Adding a new node after Yunikorn state initialization can result in a > deadlock. > The problem is that {{Context.addNode()}} holds a lock while we're waiting > for the {{NodeAccepted}} event: > {noformat} >dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, > func(event interface{}) { > nodeEvent, ok := event.(CachedSchedulerNodeEvent) > if !ok { > return > } > [...] removed for clarity > wg.Done() > }) > defer dispatcher.UnregisterEventHandler(handlerID, > dispatcher.EventTypeNode) > if err := > ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode({ > Nodes: nodesToRegister, > RmID: schedulerconf.GetSchedulerConf().ClusterID, > }); err != nil { > log.Log(log.ShimContext).Error("Failed to register nodes", > zap.Error(err)) > return nil, err > } > // wait for all responses to accumulate > wg.Wait() <--- shim gets stuck here > {noformat} > If tasks are being processed, then the dispatcher will try to retrieve the > evend handler, which is returned from Context: > {noformat} > go func() { > for { > select { > case event := <-getDispatcher().eventChan: > switch v := event.(type) { > case events.TaskEvent: > getEventHandler(EventTypeTask)(v) <--- > eventually calls Context.getTask() > case events.ApplicationEvent: > getEventHandler(EventTypeApp)(v) > case events.SchedulerNodeEvent: > getEventHandler(EventTypeNode)(v) > {noformat} > Since {{addNode()}} is holding a write lock, the event processing loop gets > stuck, so {{registerNodes()}} will never progress. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848889#comment-17848889 ] Peter Bacsko commented on YUNIKORN-2637: OK. Let's get this in and we'll see what happens then. > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-2569) Helm upgrade behaviour
[ https://issues.apache.org/jira/browse/YUNIKORN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tseng Hsi-Huang reassigned YUNIKORN-2569: - Assignee: Tseng Hsi-Huang > Helm upgrade behaviour > -- > > Key: YUNIKORN-2569 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2569 > Project: Apache YuniKorn > Issue Type: Test >Reporter: Manikandan R >Assignee: Tseng Hsi-Huang >Priority: Major > > Need to test the Yunikorn upgrade behaviour through Helm. > For example, > 1. Create cluster using kind create. > 2. Deploy old versions of Yunikorn (say, 1.2 or 1.3 or 1.4) using helm deploy. > 3. Sanity checks to ensure deployed version is working as expected. > 4. Upgrade YK version to the latest master (1.6) using helm upgrade. > 5. Document the behaviour especially when there are any issues. > Repeat for each old versions (1.2, 1.3 and 1.4). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848863#comment-17848863 ] Wilfred Spiegelenburg commented on YUNIKORN-2637: - The case we are solving here is the correct removal of a pod that was registered and then stopped. In this case if the pod was assigned to a node it gets removed, this includes from the core also. In the case that it was not assigned to a node the request gets removed. I think both core and k8shim are affected by this after looking at the details in YUNIKORN-2526. So I am not sure if that is the root cause of the difference... > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Assigned] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YUNIKORN-2637: -- Assignee: Wilfred Spiegelenburg > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2636) Admission Controller ignores existing Queue/ApplicationID annotations
[ https://issues.apache.org/jira/browse/YUNIKORN-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2636: - Labels: pull-request-available (was: ) > Admission Controller ignores existing Queue/ApplicationID annotations > - > > Key: YUNIKORN-2636 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2636 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Yu-Lin Chen >Assignee: Yu-Lin Chen >Priority: Major > Labels: pull-request-available > > Admission controller should patch missing queueName/applicationID to pod if > they were not set. > However, if the queueName/applicationID were only set in annotation, AM will > still patch the default value to pod. The updatePodLabel() only check > existing labels. > * > [https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/util.go#L39-L40] > * > [https://github.com/apache/yunikorn-k8shim/blob/6f2800f689e9e341c736a6af8cbf178a711a9423/pkg/admission/util.go#L53] > > {*}What will impact users{*}: > When we set placement rule to"provided" and queueName/applicationID were only > set in annotation. > The application will be submitted to default queue instead the queue set in > the pod anonotation. > {*}Steps to reproduce the issue{*}: > 1. Update placement rule to 'provided' > {code:java} > apiVersion: v1 > kind: ConfigMap > metadata: > name: yunikorn-configs > namespace: yunikorn > data: > log.level: "DEBUG" > queues.yaml: | > partitions: > - name: default > queues: > - name: root > submitacl: '*' > placementrules: > - name: provided > create: true {code} > 2. Create pod with queue/app-id annotations > {code:java} > apiVersion: v1 > kind: Pod > metadata: > labels: > app: sleep > annotations: > yunikorn.apache.org/app-id: "application-sleep-0001" > yunikorn.apache.org/queue: "root.sandbox" > name: pod-with-annotations-only > spec: > schedulerName: yunikorn > restartPolicy: Never > containers: > - name: sleep-6000s > image: "alpine:latest" > command: ["sleep", "6000"] > resources: > requests: > cpu: "100m" > memory: "500M"{code} > 3. Check below labels in pod, thees are added by AM. > * queue=root.default > * applicationId=yunikorn-default-autogen > 4. Check the app's queue: > * > [http://localhost:9889/ws/v1/partition/default/application/application-sleep-0001] > > The app is in "root.default", but it should be in "root.sandbox". > (Please check the queue ordering: > [Doc|https://yunikorn.apache.org/docs/user_guide/labels_and_annotations_in_yunikorn#annotations-in-yunikorn]) > To fix this issue, we should let AM check all the possible locations. > Notes: As part of the future change in YUNIKORN-2504, If the same metadata > (queue or applicationID) is specified in multiple locations and their values > are inconsistent, the pod should be reject. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2631) Support canonical labels for queue/applicationId in Admission Controller
[ https://issues.apache.org/jira/browse/YUNIKORN-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated YUNIKORN-2631: Labels: pull-request-available release-notes (was: pull-request-available) > Support canonical labels for queue/applicationId in Admission Controller > > > Key: YUNIKORN-2631 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2631 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Yu-Lin Chen >Assignee: Yu-Lin Chen >Priority: Major > Labels: pull-request-available, release-notes > > Admission controller adds applicationID and label to Pod if they are not > already set in the Pod. > According to the new policy defined in YUNIKORN-1351. > Admission Controller will change to patch canonical label/annotation in the > future releases. > * yunikorn.apache.org/app-id (Canonical Label) > * yunikorn.apache.org/queue (Canonical Label) > To avoid an upgrade problem where the admission controller gets started > first, AM needs to generate both canonical/non-canonical labels in 1.6.0. > (This ensures that the 1.5.0 scheduler could understand labels generated in > the 1.6.0 admission controller) In 1.7.0, we can switch to generating only > the canonical label in AM. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Updated] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does
[ https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YUNIKORN-2637: - Labels: pull-request-available (was: ) > finalizePods should ignore pods like registerPods does > -- > > Key: YUNIKORN-2637 > URL: https://issues.apache.org/jira/browse/YUNIKORN-2637 > Project: Apache YuniKorn > Issue Type: Bug > Components: shim - kubernetes >Reporter: Wilfred Spiegelenburg >Priority: Major > Labels: pull-request-available > > The initialisation code is a two step process for pods: first list all pods > and add them to the system in registerPods(). This returns a list of pods > processed. > The second step happens after event handlers are turned on and nodes have > been cleaned up etc. During the second step pods from the first step are > checked and removed. However pods that were already in a terminated state in > step 1 get removed again. Although the step should be idempotent this is > unneeded. When iterating over the existing pods any pod in a terminal state > should be skipped. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org