[jira] [Created] (YUNIKORN-2599) AppStateChange/AppTaskCompleted event cannot be handled in many states

2024-05-02 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2599:
--

 Summary: AppStateChange/AppTaskCompleted event cannot be handled 
in many states
 Key: YUNIKORN-2599
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2599
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - yarn
Reporter: Peter Bacsko


After YUNIKORN-2597 got merged, it became clear that we keep sending an 
{{AppStateChange}} event which cannot be handled by the state machine. There 
isn't any state in the FSM object which would actually be able to process this 
event.

{{AppTaskCompleted}} is very similar, it is only processed in {{Resuming}} 
state.

If someone runs the test case TestApplicationScheduling, the following errors 
are displayed:
{noformat}
[...]
2024-05-02T18:08:14.856+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppStateChange", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
2024-05-02T18:08:14.856+0200INFOcore.scheduler.application  
[...] 
2024-05-02T18:08:14.857+0200INFOcore.scheduler.partition
scheduler/partition.go:928  scheduler allocation processed  {"appID": 
"app0001", "allocationKey": "task0002", "allocatedResource": 
"map[memory:1000 pods:1 vcore:1]", "placeholder": false, "targetNode": 
"test.host.02"}
2024-05-02T18:08:14.857+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppStateChange", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
[...]
2024-05-02T18:08:15.856+0200INFOshim.fsmcache/task_state.go:380 
Task state transition   {"app": "app0001", "task": "task0001", "taskAlias": 
"default/task0001", "source": "Bound", "destination": "Completed", "event": 
"CompleteTask"}
2024-05-02T18:08:15.856+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
[...]
2024-05-02T18:08:16.858+0200INFOshim.fsmcache/task_state.go:380 
Task state transition   {"app": "app0001", "task": "task0002", "taskAlias": 
"default/task0002", "source": "Bound", "destination": "Completed", "event": 
"CompleteTask"}
2024-05-02T18:08:16.858+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
[...]
2024-05-02T18:08:16.859+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppStateChange", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/re

[jira] [Resolved] (YUNIKORN-2597) Improve error messages in Context

2024-05-02 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2597.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

Merged to master & branch-1.5.

> Improve error messages in Context
> -
>
> Key: YUNIKORN-2597
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2597
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> The logging in {{cache.Context}} related to event handling needs some 
> improvement:
> 1) When an error occurs while the task event handler is being retrieved, it 
> logs "failed to handle application event" and the task ID is omitted, which 
> makes debugging harder.
> 2) If {{canHandle()}} returns false, we don't do anything, just return. 
> Again, this makes debugging much harder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2518) Allow recovery queue in REST requests

2024-05-02 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2518.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Allow recovery queue in REST requests
> -
>
> Key: YUNIKORN-2518
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2518
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Ted Lin
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> The current checks for the REST requests that require a queue path to be 
> provided prevent looking at the {{root.@recover@}} queue.
> The validator filters the queue names which makes it impossible to check if 
> the queue has any running applications or pod after initialisation using the 
> REST requests. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org