[jira] [Resolved] (YUNIKORN-2725) Temporarily disable failing e2e preemption tests

2024-07-04 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2725.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Temporarily disable failing e2e preemption tests
> 
>
> Key: YUNIKORN-2725
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2725
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: shim - kubernetes, test - e2e
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Disable the following tests to have green builds:
> Verify_preemption_on_priority_queue
> Verify_basic_preemption
> Verify_allow_preemption_tag



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2725) Temporarily disable failing e2e tests

2024-07-04 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2725:
--

 Summary: Temporarily disable failing e2e tests
 Key: YUNIKORN-2725
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2725
 Project: Apache YuniKorn
  Issue Type: Test
  Components: shim - kubernetes, test - e2e
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Disable the following tests to have green builds:

Verify_preemption_on_priority_queue
Verify_basic_preemption



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2724) Improve the signature of methods notifyTaskComplete() and ensureAppAndTaskCreated()

2024-07-04 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2724:
--

 Summary: Improve the signature of methods notifyTaskComplete() and 
ensureAppAndTaskCreated()
 Key: YUNIKORN-2724
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2724
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Peter Bacsko


>From the review [https://github.com/apache/yunikorn-k8shim/pull/864]

"I also think we need to change the signature for {{notifyTaskComplete(string, 
string)}} to {{notifyTaskComplete(*Application, string)}} Probably better to 
use a separate jira for that as it flows through into {{NotifyTaskComplete()}} 
and some tests. The 2 tests have the application pointer already. It removes a 
number of extra getApplication() calls we really do not need.
Similar for {{ensureAppAndTaskCreated()}} which is only ever called from this 
function. Add a parameter to it to make it: {{ensureAppAndTaskCreated(*v1.Pod, 
*Application)}} and only execute application creation {{{}if app == nil{}}}. 
This can be either in this jira or in a separate one."

That is, optimize the methods so that we avoid unnecessary {{GetApplication()}} 
calls.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2182) Set ReadHeaderTimeout in http server

2024-07-03 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2182.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Set ReadHeaderTimeout in http server
> 
>
> Key: YUNIKORN-2182
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2182
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common, webapp
>Reporter: Wilfred Spiegelenburg
>Assignee: Chenchen Lai
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> Potential Slowloris Attack because ReadHeaderTimeout is not configured in the 
> http.Server (gosec)
> We do not set ReadTimeout or ReadHeaderTimeout so we do not have a timeout at 
> all at the moment.
> BTW: this is not important for the webtest servers we build as they are just 
> for our tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2568) Move all xxxEvents types to objects/events

2024-07-02 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2568.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Move all xxxEvents types to objects/events
> --
>
> Key: YUNIKORN-2568
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2568
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package

2024-07-02 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2564.

Fix Version/s: 1.6.0
   Resolution: Fixed

> [Umbrella] Move xxxEvents types to a different package
> --
>
> Key: YUNIKORN-2564
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2564
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 1.6.0
>
>
> There are several Events that can be moved to a different package:
> * queueEvents
> * applicationEvents
> * askEvents
> * nodeEvents
> There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity 
> to clean it up a bit and move these under eg. 
> {{pkg/scheduler/objects/events}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2708) Release notes for 1.5.2

2024-06-28 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2708:
--

 Summary: Release notes for 1.5.2
 Key: YUNIKORN-2708
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2708
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2709) Update website for 1.5.2

2024-06-28 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2709:
--

 Summary: Update website for 1.5.2
 Key: YUNIKORN-2709
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2709
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: release
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2706) [UMBRELLA] YuniKorn 1.5.2 release efforts

2024-06-28 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2706:
--

 Summary: [UMBRELLA] YuniKorn 1.5.2 release efforts
 Key: YUNIKORN-2706
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2706
 Project: Apache YuniKorn
  Issue Type: Task
  Components: release
Reporter: Peter Bacsko
Assignee: Peter Bacsko


This umbrella is to track the work items needed for the 1.5.2 release.

Release manager: Peter Bacsko.

This release only contains bug fixes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2707) Tagging for 1.5.2

2024-06-28 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2707:
--

 Summary: Tagging for 1.5.2
 Key: YUNIKORN-2707
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2707
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2704) Event publish errors out when predicates fail

2024-06-28 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2704.

Fix Version/s: 1.6.0
   1.5.2
   Resolution: Fixed

Merged to master & branch-1.5

> Event publish errors out when predicates fail
> -
>
> Key: YUNIKORN-2704
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2704
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Mit Desai
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 1.6.0, 1.5.2
>
>
> I consistently see this error in the logs when events are published.
> I did put some debug logs and found that I only get it when the events for 
> untolerated taints are published.
> E0618 17:43:17.858946       1 event_broadcaster.go:270] "Server rejected 
> event (will not retry!)" err="Event \"<>.17da2a31072bb32f\" is 
> invalid: [action: Required value, reason: Required value]" 
> event="\{ObjectMeta:{<>.17da2a31072bb32f  dpi-dev    0 
> 0001-01-01 00:00:00 + UTC   map[] map[] [] [] 
> []},EventTime:2024-06-18 17:43:17.857332069 + UTC 
> m=+84279.014490005,Series:nil,ReportingController:yunikorn,ReportingInstance:yunikorn-yunikorn-scheduler-59bdc88fdc-7h5bt,Action:,Reason:,Regarding:\{Pod
>  <> <> 5c90315c-a07d-4801-9ecc-baf61ee45f11 v1 
> 4323324038 },Related:nil,Note:Predicate failed for request 
> '5c90315c-a07d-4801-9ecc-baf61ee45f11' with message: 'node(s) had untolerated 
> taint \{<>: <>}',Type:Normal,DeprecatedSource:\{ 
> },DeprecatedFirstTimestamp:0001-01-01 00:00:00 + 
> UTC,DeprecatedLastTimestamp:0001-01-01 00:00:00 + UTC,DeprecatedCount:0,}"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2694) Improve placement rule funtion's test coverage - 2

2024-06-25 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2694.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Improve placement rule funtion's test coverage - 2
> --
>
> Key: YUNIKORN-2694
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2694
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2683) Unnecessary error is logged when resource usage is increased

2024-06-25 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2683.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Unnecessary error is logged when resource usage is increased
> 
>
> Key: YUNIKORN-2683
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2683
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The refactored code in YUNIKORN-2542 contains an unnecessary warning message:
> {noformat}
>   appGroup := userTracker.getGroupForApp(applicationID)
>   log.Log(log.SchedUGM).Debug("Increasing resource usage for user",
>   zap.String("user", user.User),
>   zap.String("queue path", queuePath),
>   zap.String("application", applicationID),
>   zap.String("group", appGroup),
>   zap.Stringer("resource", usage))
>   groupTracker := m.GetGroupTracker(appGroup)
>   if groupTracker == nil {
>   log.Log(log.SchedUGM).Error("group tracker should be available 
> in groupTrackers map",
>   zap.String("application", applicationID),
>   zap.String("group", appGroup))
>   return
>   }
> ...
> {noformat}
> We don't always have a {{groupTracker}}. The previous code simply called 
> {{increaseTrackedResource()}} on an empty tracker:
> {noformat}
> func (ut *UserTracker) increaseTrackedResource(queuePath string, 
> applicationID string, usage *resources.Resource) {
>   ut.Lock()
>   defer ut.Unlock()
>   ut.events.sendIncResourceUsageForUser(ut.userName, queuePath, usage)
>   hierarchy := strings.Split(queuePath, configs.DOT)
>   ut.queueTracker.increaseTrackedResource(hierarchy, applicationID, user, 
> usage)
>   gt := ut.appGroupTrackers[applicationID]
>   log.Log(log.SchedUGM).Debug("Increasing resource usage for group",
>   zap.String("group", gt.getName()),
>   zap.Strings("queue path", hierarchy),
>   zap.String("application", applicationID),
>   zap.Stringer("resource", usage))
>   gt.increaseTrackedResource(queuePath, applicationID, usage, 
> ut.userName) <- can be null
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2661) Fix hard-coded boolean in setLimit

2024-06-24 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2661.

Fix Version/s: 1.6.0
   1.5.2
   Resolution: Fixed

Merged to master & branch-1.5

> Fix hard-coded boolean in setLimit
> --
>
> Key: YUNIKORN-2661
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2661
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
>
> Inside the UGM code {{setLimit()}}, we don't pass down {{doWildcardCheck}}, 
> so this variables never reaches the leafs:
> {noformat}
> / Note: Lock free call. The Lock of the linked tracker (UserTracker and 
> GroupTracker) should be held before calling this function.
> func (qt *QueueTracker) setLimit(hierarchy []string, maxResource 
> *resources.Resource, maxApps uint64, useWildCard bool, trackType 
> trackingType, doWildCardCheck bool) {
>   log.Log(log.SchedUGM).Debug("Setting limits",
>   zap.String("queue path", qt.queuePath),
>   zap.Strings("hierarchy", hierarchy),
>   zap.Uint64("max applications", maxApps),
>   zap.Stringer("max resources", maxResource),
>   zap.Bool("use wild card", useWildCard))
>   // depth first: all the way to the leaf, create if not exists
>   // more than 1 in the slice means we need to recurse down
>   if len(hierarchy) > 1 {
>   childName := hierarchy[1]
>   if qt.childQueueTrackers[childName] == nil {
>   qt.childQueueTrackers[childName] = 
> newQueueTracker(qt.queuePath, childName, trackType)
>   }
>   qt.childQueueTrackers[childName].setLimit(hierarchy[1:], 
> maxResource, maxApps, useWildCard, trackType, false)  <-- should be 
> "doWildCardCheck" not "false"
> ...
> {noformat}
> Fix this and create a unit test for {{setLimit()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2516) Update documentation about event.RESTResponseSize

2024-06-21 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2516.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Update documentation about event.RESTResponseSize
> -
>
> Key: YUNIKORN-2516
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2516
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2512) Event system properties are not used

2024-06-21 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2512.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Event system properties are not used
> 
>
> Key: YUNIKORN-2512
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2512
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 1.6.0
>
>
> There two properties which are not used by the event system:
> # The property "event.requestCapacity" is supposed to determine the size of a 
> slice which is used between the core and shim to transfer events in every 2 
> seconds. However, right now it's not used at all, we use the default (1000) 
> every time.
> # The property "RESTResponseSize" is not even in the code at all. It 
> influences the maximum number of entries returned in the batch API. 
> Currently, the hard coded value is 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2245) Application sorting: improve pending resource filtering

2024-06-21 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2245.

Resolution: Won't Do

> Application sorting: improve pending resource filtering
> ---
>
> Key: YUNIKORN-2245
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2245
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
>
> When sorting applications, we do a filtering on pending resources:
> {noformat}
> func filterOnPendingResources(apps map[string]*Application) []*Application {
>   filteredApps := make([]*Application, 0)
>   for _, app := range apps {
>   // Only look at app when pending-res > 0
>   if resources.StrictlyGreaterThanZero(app.GetPendingResource()) {
>   filteredApps = append(filteredApps, app)
>   }
>   }
>   return filteredApps
> }
> {noformat}
> This filtering is relatively expensive, but necessary, because during the 
> lifecycle of an application, {{sa.pending}} can become 0 and in this case, we 
> don't want to schedule anything from the app.
> Suggested approach is to track total pendingAskRepeats inside the app. That 
> way we don't need to call {{resources.StrictlyGreaterThanZero()}} and we 
> perform a simple integer comparison.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Closed] (YUNIKORN-2221) Performance improvements phase II

2024-06-21 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko closed YUNIKORN-2221.
--

> Performance improvements phase II
> -
>
> Key: YUNIKORN-2221
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2221
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
> Fix For: 1.5.0
>
>
> Umbrella JIRA for further performance improvements in Yunikorn.
> The main issues have been addressed in YUNIKORN-1715. However, it's still 
> possible to reduce memory and CPU usage further by doing smaller things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2221) Performance improvements phase II

2024-06-21 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2221.

Fix Version/s: 1.5.0
   Resolution: Fixed

> Performance improvements phase II
> -
>
> Key: YUNIKORN-2221
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2221
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler, shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Minor
> Fix For: 1.5.0
>
>
> Umbrella JIRA for further performance improvements in Yunikorn.
> The main issues have been addressed in YUNIKORN-1715. However, it's still 
> possible to reduce memory and CPU usage further by doing smaller things.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2653) Gang scheduling K8s event formatting compliance

2024-06-19 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2653.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Gang scheduling K8s event formatting compliance
> ---
>
> Key: YUNIKORN-2653
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2653
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The K8s events provide definitions and rules around the content of the fields 
> within the event. Adjust the content of gang scheduling related events to 
> comply with the rules.
> Focussed on the reason and action fields only.
>   * 'reason' is the reason this event is generated. 'reason' should be short 
> and unique; it should be in UpperCamelCase format (starting with a capital 
> letter). 
>  * 'action' explains what happened with regarding/ what action did the 
> ReportingController take in objects name; it should be in UpperCamelCase 
> format (starting with a capital letter). 
> No space or long text.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2683) Unnecessary error is logged when resource usage is increased

2024-06-19 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2683:
--

 Summary: Unnecessary error is logged when resource usage is 
increased
 Key: YUNIKORN-2683
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2683
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Peter Bacsko


The refactored code in YUNIKORN-2542 contains an unnecessary warning message:

{noformat}
appGroup := userTracker.getGroupForApp(applicationID)
log.Log(log.SchedUGM).Debug("Increasing resource usage for user",
zap.String("user", user.User),
zap.String("queue path", queuePath),
zap.String("application", applicationID),
zap.String("group", appGroup),
zap.Stringer("resource", usage))
groupTracker := m.GetGroupTracker(appGroup)
if groupTracker == nil {
log.Log(log.SchedUGM).Error("group tracker should be available 
in groupTrackers map",
zap.String("application", applicationID),
zap.String("group", appGroup))
return
}
...
{noformat}

We don't always have a {{groupTracker}}. The previous code simply called 
{{increaseTrackedResource()}} on an empty tracker:

{noformat}
func (ut *UserTracker) increaseTrackedResource(queuePath string, applicationID 
string, usage *resources.Resource) {
ut.Lock()
defer ut.Unlock()
ut.events.sendIncResourceUsageForUser(ut.userName, queuePath, usage)
hierarchy := strings.Split(queuePath, configs.DOT)
ut.queueTracker.increaseTrackedResource(hierarchy, applicationID, user, 
usage)
gt := ut.appGroupTrackers[applicationID]
log.Log(log.SchedUGM).Debug("Increasing resource usage for group",
zap.String("group", gt.getName()),
zap.Strings("queue path", hierarchy),
zap.String("application", applicationID),
zap.Stringer("resource", usage))
gt.increaseTrackedResource(queuePath, applicationID, usage, 
ut.userName) <- can be null
}
{noformat}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2680) Improve placement rule funtion's test coverage

2024-06-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2680.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Improve placement rule funtion's test coverage
> --
>
> Key: YUNIKORN-2680
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2680
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2681) Data race in TestGetStream_Limit

2024-06-18 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2681:
--

 Summary: Data race in TestGetStream_Limit
 Key: YUNIKORN-2681
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2681
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler, test - unit
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Data race was detected during an unit test:

{noformat}
==
WARNING: DATA RACE
Write at 0x0170c220 by goroutine 2575:
  github.com/apache/yunikorn-core/pkg/webservice.NewWebApp()
  
/home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/webservice.go:82 
+0x11c
  github.com/apache/yunikorn-core/pkg/webservice.TestCheckHealthStatusNotFound()
  
/home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2574
 +0x2f
  testing.tRunner()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x44

Previous read at 0x0170c220 by goroutine 2542:
  github.com/apache/yunikorn-core/pkg/webservice.getStream()
  
/home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers.go:1225 
+0xbd3
  github.com/apache/yunikorn-core/pkg/webservice.TestGetStream_Limit.gowrap4()
  
/home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2308
 +0x4f

Goroutine 2575 (running) created at:
  testing.(*T).Run()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x825
  testing.runTests.func1()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2161 +0x85
  testing.tRunner()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e
  testing.runTests()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2159 +0x8be
  testing.(*M).Run()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:2027 +0xf17
  main.main()
  _testmain.go:163 +0x2e4

Goroutine 2542 (running) created at:
  github.com/apache/yunikorn-core/pkg/webservice.TestGetStream_Limit()
  
/home/runner/work/yunikorn-core/yunikorn-core/pkg/webservice/handlers_test.go:2308
 +0xbb7
  testing.tRunner()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1689 +0x21e
  testing.(*T).Run.gowrap1()
  /opt/hostedtoolcache/go/1.22.4/x64/src/testing/testing.go:1742 +0x44
==
2024-06-18T13:40:54.182ZINFOcore.events 
events/event_streaming.go:164   Removing event stream consumer  {"name": 
"host-1", "creation time": "2024-06-18T13:40:54.181Z"}
2024-06-18T13:40:54.182ZINFOcore.scheduler.health   
webservice/handlers.go:623  Health check is not available
--- FAIL: TestCheckHealthStatusNotFound (0.00s)
testing.go:1398: race detected during execution of test
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2673) Improve newFilter funtion's test coverage in filter.go

2024-06-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2673.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Improve newFilter funtion's test coverage in filter.go
> --
>
> Key: YUNIKORN-2673
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2673
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2515) Add property event.RESTResponseSize to the batch event handler

2024-06-12 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2515.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Add property event.RESTResponseSize to the batch event handler
> --
>
> Key: YUNIKORN-2515
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2515
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2670) Improve util funtion's test coverage

2024-06-11 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2670.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Improve util funtion's test coverage
> 
>
> Key: YUNIKORN-2670
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2670
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Improve the following funtion's test coverage in util.go
>  * ZeroTimeInUnixNano
>  * GetNewUUID
>  * IsRecoveryQueue



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2669) nil pointer dereference error

2024-06-09 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2669.

Resolution: Duplicate

This looks like a dup of YUNIKORN-2562. The solution for this has been 
delivered in 1.5.1. It's also on master.

> nil pointer dereference error
> -
>
> Key: YUNIKORN-2669
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2669
> Project: Apache YuniKorn
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Junyoung Park
>Assignee: Peter Bacsko
>Priority: Major
>
> Environment: AWS EKS 1.26
> yunikorn-scheduler logs
> {code:java}
> panic: runtime error: invalid memory address or nil pointer 
> dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 
> pc=0x179b2f5]
> goroutine 50 
> [running]:github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc000661000,
>  {0xc008ad14a0, 0x24}) 
> github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/objects/application.go:1739
>  
> +0x615github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0xc00046a100?,
>  0xc01436c880)
> github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/partition.go:1281 
> +0x27fgithub.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc000502680?,
>  {0xc02014da60, 0x1, 0xc0112f5ee8?}, {0xc0060f8980, 0xb})
> github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/context.go:868 
> +0x9egithub.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc00046a100?,
>  0xc0145e8eb0?)  
> github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/context.go:750 
> +0xa5github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000120990)
>
> github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/scheduler.go:111 
> +0x16ecreated by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1 
> github.com/apache/yunikorn-core@v1.4.0-1/pkg/scheduler/scheduler.go:55 +0x9c 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2637) finalizePods should ignore pods like registerPods does

2024-06-07 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2637.

Fix Version/s: 1.6.0
   1.5.2
   Resolution: Fixed

Merged to master & branch-1.5.

> finalizePods should ignore pods like registerPods does
> --
>
> Key: YUNIKORN-2637
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2637
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
>
> The initialisation code is a two step process for pods: first list all pods 
> and add them to the system in registerPods(). This returns a list of pods 
> processed.
> The second step happens after event handlers are turned on and nodes have 
> been cleaned up etc. During the second step pods from the first step are 
> checked and removed. However pods that were already in a terminated state in 
> step 1 get removed again. Although the step should be idempotent this is 
> unneeded. When iterating over the existing pods any pod in a terminal state 
> should be skipped.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2668) Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails

2024-06-07 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2668.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails 
> 
>
> Key: YUNIKORN-2668
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2668
> Project: Apache YuniKorn
>  Issue Type: Task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The test case TestUpdateAllocation_NewTask_AssumePodFails occasionally fails 
> due to a deadlock problem described in YUNIKORN-2629. Until that ticket is 
> resolved, let's disable this test for the time being, so upstream tests don't 
> fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2668) Temporarily disable TestUpdateAllocation_NewTask_AssumePodFails

2024-06-07 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2668:
--

 Summary: Temporarily disable 
TestUpdateAllocation_NewTask_AssumePodFails 
 Key: YUNIKORN-2668
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2668
 Project: Apache YuniKorn
  Issue Type: Task
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The test case TestUpdateAllocation_NewTask_AssumePodFails occasionally fails 
due to a deadlock problem described in YUNIKORN-2629. Until that ticket is 
resolved, let's disable this test for the time being, so upstream tests don't 
fail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2561) Support topology spread constraints on placeholder pods

2024-06-06 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2561.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Support topology spread constraints on placeholder pods
> ---
>
> Key: YUNIKORN-2561
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2561
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Jacob Salway
>Assignee: Jacob Salway
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> If a pod has a topology spread constraint with a `whenUnsatisfiable: 
> DoNotSchedule` constraint and is used as part of a task group, it is not 
> possible to pass the constraint to the placeholder pods created by Yunikorn.
> This can result in placeholder pods being placed on a node that would violate 
> the original pod's topology spread constraint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2643) utils.go WaitForCondition improvement

2024-06-06 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2643.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master. Thanks [~mean-world] for the contribution.

> utils.go WaitForCondition improvement 
> --
>
> Key: YUNIKORN-2643
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2643
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: HUAN-IU LIOU
>Assignee: HUAN-IU LIOU
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2663) Improve ACL struct funtion's test coverage

2024-06-06 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2663.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Improve ACL struct funtion's test coverage
> --
>
> Key: YUNIKORN-2663
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2663
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Remove unreachable code in NewACL func
> Improve the following funtion's test coverage in acl.go
>  * TestSetUsers
>  * TestSetGroups



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2666) Fix DeepEqual comparison in Test_fixedRule_ruleDAO

2024-06-06 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2666.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Fix DeepEqual comparison in Test_fixedRule_ruleDAO 
> ---
>
> Key: YUNIKORN-2666
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2666
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler, test - unit
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The test case {{Test_fixedRule_ruleDAO/filter}} can randomly fail due to the 
> non-deterministic nature of map key iteration:
> {noformat}
> fixed_rule_test.go:285: assertion failed: 
> --- tt.want
> +++ ruleDAO
>   {
>   Name:   "fixed",
>   Parameters: {"create": "true", "qualified": "false", 
> "queue": "default"},
>   Filter: {
>   Type: "allow",
>   UserList: nil,
>   GroupList: []string{
> - "group1",
> + "group2",
> - "group2",
> + "group1",
>   },
>   UserExp:  "",
>   GroupExp: "",
>   },
>   ParentRule: nil,
>   }
> {noformat}
> We use {{maps.Keys()}} when we create the user list and group list in 
> {{FilterDAO}}. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2666) Fix DeepEqual comparison in Test_fixedRule_ruleDAO

2024-06-06 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2666:
--

 Summary: Fix DeepEqual comparison in Test_fixedRule_ruleDAO 
 Key: YUNIKORN-2666
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2666
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler, test - unit
Reporter: Peter Bacsko


The test case {{Test_fixedRule_ruleDAO/filter}} can randomly fail due to the 
non-deterministic nature of map key iteration:

{noformat}
fixed_rule_test.go:285: assertion failed: 
--- tt.want
+++ ruleDAO
  {
Name:   "fixed",
Parameters: {"create": "true", "qualified": "false", "queue": 
"default"},
Filter: {
Type: "allow",
UserList: nil,
GroupList: []string{
-   "group1",
+   "group2",
-   "group2",
+   "group1",
},
UserExp:  "",
GroupExp: "",
},
ParentRule: nil,
  }
{noformat}

We use {{maps.Keys()}} when we create the user list and group list in 
{{FilterDAO}}. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2650) Complete or remove web_server_test#TestProxy

2024-06-06 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2650.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Complete or remove web_server_test#TestProxy
> 
>
> Key: YUNIKORN-2650
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2650
> Project: Apache YuniKorn
>  Issue Type: Test
>Reporter: Chia-Ping Tsai
>Assignee: Chenchen Lai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> web_server_test has a empty test case: TestProxy [0]. It seems to me there is 
> proxy-related test [1].
> [0] 
> https://github.com/apache/yunikorn-k8shim/blob/58adfe941d2d8dae5544af8b49e435f304678807/pkg/webtest/web_server_test.go#L82
> [1] 
> https://github.com/apache/yunikorn-k8shim/blob/58adfe941d2d8dae5544af8b49e435f304678807/pkg/webtest/web_server_test.go#L73



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2514) Update documentation about event.requestCapacity

2024-06-05 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2514.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Update documentation about event.requestCapacity
> 
>
> Key: YUNIKORN-2514
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2514
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: documentation
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2654) Remove unused code in k8shim context

2024-06-04 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2654.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove unused code in k8shim context
> 
>
> Key: YUNIKORN-2654
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2654
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Chenchen Lai
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> The NotifyApplicationComplete and NotifyApplicationFail  function are not 
> called by anything and are unused code.
> The K8shim does not trigger the application completion or failure. This is 
> triggered by the core when the application no longer has any activity 
> registered.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2647) Flaky test TestUpdateNodeCapacity

2024-06-04 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2647.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Flaky test TestUpdateNodeCapacity
> -
>
> Key: YUNIKORN-2647
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2647
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: test - unit
>Reporter: Wilfred Spiegelenburg
>Assignee: Tseng Hsi-Huang
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> Same as we saw in YUNIKORN-2573 the single node update test might fail:
> {code:java}
> --- FAIL: TestUpdateNodeCapacity (0.03s)
>     operation_test.go:446: Expected partition resource map[memory:1 
> vcore:2], doesn't match with actual partition resource 
> map[memory:1 vcore:2]{code}
> We calculate the delta resources when updating node capacity with that delta 
> we update resources in partition.
> The test would fail with following order same as for multiple nodes
> node.SetCapacity() -> waitForAvailableNodeResource() ->  
> partitionInfo.GetTotalPartitionResource()  -> 
> partition.updatePartitionResource()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2659) Improve config validator funtion's test coverage

2024-06-04 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2659.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Improve config validator funtion's test coverage
> 
>
> Key: YUNIKORN-2659
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2659
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Improve the following funtion's test coverage in configvalidator.go
>  * checkPlacementRule 
>  * checkLimitResource 
>  * checkLimit 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2661) Fix hard-coded boolean in setLimit

2024-06-03 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2661:
--

 Summary: Fix hard-coded boolean in setLimit
 Key: YUNIKORN-2661
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2661
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Inside the UGM code {{setLimit()}}, we don't pass down {{doWildcardCheck}}, so 
this variables never reaches the leafs:

{noformat}
/ Note: Lock free call. The Lock of the linked tracker (UserTracker and 
GroupTracker) should be held before calling this function.
func (qt *QueueTracker) setLimit(hierarchy []string, maxResource 
*resources.Resource, maxApps uint64, useWildCard bool, trackType trackingType, 
doWildCardCheck bool) {
log.Log(log.SchedUGM).Debug("Setting limits",
zap.String("queue path", qt.queuePath),
zap.Strings("hierarchy", hierarchy),
zap.Uint64("max applications", maxApps),
zap.Stringer("max resources", maxResource),
zap.Bool("use wild card", useWildCard))
// depth first: all the way to the leaf, create if not exists
// more than 1 in the slice means we need to recurse down
if len(hierarchy) > 1 {
childName := hierarchy[1]
if qt.childQueueTrackers[childName] == nil {
qt.childQueueTrackers[childName] = 
newQueueTracker(qt.queuePath, childName, trackType)
}
qt.childQueueTrackers[childName].setLimit(hierarchy[1:], 
maxResource, maxApps, useWildCard, trackType, false)
...
{noformat}

Fix this and create a unit test for {{setLimit()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2649) Improve CalculateAbsUsedCapacity & CompUsageRatio funtion's test coverage in resources.go

2024-05-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2649.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Improve CalculateAbsUsedCapacity & CompUsageRatio funtion's test coverage in 
> resources.go
> -
>
> Key: YUNIKORN-2649
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2649
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - common
>Reporter: JunHong Peng
>Assignee: JunHong Peng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2581) Expose running placement rules in REST

2024-05-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2581.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Expose running placement rules in REST
> --
>
> Key: YUNIKORN-2581
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2581
> Project: Apache YuniKorn
>  Issue Type: New Feature
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Since introducing the use of placement rules always and the recovery rule the 
> queue config does not correctly show the running rules.
> Also if a config update has been rejected, for any reason, the rules would 
> not be correct
> Exposing the configured rules from the placement manager works around all 
> these issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2646) Deadlock detected during preemption

2024-05-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2646.

Fix Version/s: 1.6.0
   1.5.2
   Resolution: Fixed

> Deadlock detected during preemption
> ---
>
> Key: YUNIKORN-2646
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2646
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Dmitry
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
> Attachments: yunikorn-logs-lock.txt.gz
>
>
> Hitting deadlocks in 1.5.1
> The log is attached



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2542) Consistent logging and tracker handling for increment/decrement

2024-05-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2542.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master. Thanks [~Tseng Hsi-Huang] for the contribution.

> Consistent logging and tracker handling for increment/decrement
> ---
>
> Key: YUNIKORN-2542
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2542
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Tseng Hsi-Huang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> We log DEBUG output and use {{GroupTracker}} inconsistently in {{Manager}} 
> and in {{UserTracker}}.
> Eg.
> {{Manager.IncreaseTrackedResource()}}: only a single log output with DEBUG 
> level
> {{Manager.DecreaseTrackedResource()}}: multiple log statements, also handles 
> the group tracker which is not the case with increments
> This also affects {{UserTracker}} - logs handling are different 
> in {{increaseTrackedResource()}}/{{decreaseTrackedResource()}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2567) Remove Application reference from applicationEvents

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2567.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove Application reference from applicationEvents
> ---
>
> Key: YUNIKORN-2567
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2567
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2642) Don't set resources on the recovery queue

2024-05-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2642.

Resolution: Fixed

> Don't set resources on the recovery queue
> -
>
> Key: YUNIKORN-2642
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2642
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
>
> The resource constrainst can be set on dynamic queues based on application 
> tags. We should not set this on the recovery queue, because there's no quota 
> on them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2635) test coverage improvement: same priority case in sorter

2024-05-26 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2635.

Fix Version/s: 1.6.0
   Resolution: Fixed

> test coverage improvement: same priority case in sorter 
> 
>
> Key: YUNIKORN-2635
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2635
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: core - scheduler
>Reporter: Chen Yu Teng
>Assignee: Chen Yu Teng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application

2024-05-25 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2633.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Unnecessary warning from Partition when adding an application
> -
>
> Key: YUNIKORN-2633
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2633
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> The following is printed when adding an application:
> {noformat}
> 2024-05-17T21:53:04.716+0200  WARNcore.scheduler.queue
> scheduler/partition.go:344  Trying to set resources on a queue that is 
> not an unmanaged leaf{"queueName": "root.default"}
> {noformat}
> This message is supposed to be printed when the application defines a 
> guaranteed or max resource. After YUNIKORN-2547 it's always printed if the 
> queue is managed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2642) Don't set resources on the recovery queue

2024-05-24 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2642:
--

 Summary: Don't set resources on the recovery queue
 Key: YUNIKORN-2642
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2642
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The resource constrainst can be set on dynamic queues based on application 
tags. We should not set this on the recovery queue, because there's no quota on 
them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents

2024-05-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2566.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove AllocationAsk reference from askEvents
> -
>
> Key: YUNIKORN-2566
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2566
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2565) Remove Node reference from nodeEvents

2024-05-23 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2565.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Remove Node reference from nodeEvents
> -
>
> Key: YUNIKORN-2565
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2565
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2618) Streamline AsyncRMCallback UpdateAllocation

2024-05-22 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2618.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Streamline AsyncRMCallback UpdateAllocation
> ---
>
> Key: YUNIKORN-2618
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2618
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Wilfred Spiegelenburg
>Assignee: Yun Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> if task is not found, a nil is returned from {{context.getTask}} in  for 
> {{response.New}} processing we should just log that fact and proceed to the 
> next alloc. Simplifies the flow as we never need to check for a. nil task. We 
> should never have a pod in the cache that does not exist as a task on an 
> application.
> We retrieve the application using the application ID from the response to 
> never use the object. We only use the application ID to pass into an event. 
> The context event handler then does the exact same lookup again to process 
> the event on the app.
> We need to become much smarter in this area, double or triple lookups, 
> generate async events that just change the state of the app or task or kick 
> off another event.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2611) [UMBRELLA] YuniKorn 1.5.1 release efforts

2024-05-22 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2611.

Fix Version/s: 1.5.1
   Resolution: Fixed

> [UMBRELLA] YuniKorn 1.5.1 release efforts
> -
>
> Key: YUNIKORN-2611
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2611
> Project: Apache YuniKorn
>  Issue Type: Task
>  Components: release
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Blocker
> Fix For: 1.5.1
>
>
> This umbrella is to track the work items needed for 1.5.0 release.
> Release manager: Peter Bacsko.
> This release only consists of bug fixes. Use the filter 
> [https://issues.apache.org/jira/issues/?filter=12353383] to see the list of 
> deliverables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2614) Update website for 1.5.1

2024-05-22 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2614.

 Fix Version/s: 1.5.1
Target Version: 1.5.1
Resolution: Fixed

> Update website for 1.5.1
> 
>
> Key: YUNIKORN-2614
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2614
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: release
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2639) Clarify release procedure for minor releases

2024-05-21 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2639:
--

 Summary: Clarify release procedure for minor releases
 Key: YUNIKORN-2639
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2639
 Project: Apache YuniKorn
  Issue Type: Task
  Components: release
Reporter: Peter Bacsko


After the release of 1.5.1, we realized that we need to properly define the 
release process for a minor release. This needs to be properly documented.

The clarification should cover things like:
# What it can and can't include (no features/bugfixes only)
# How to publish docs? Shall we keep the current "a.b.c" version on the website 
or remove it and publish "a.b.c+1"?
# Communication: possible difference in release notes, announcement, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2633) Unnecessary warning from Partition when adding an application

2024-05-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2633:
--

 Summary: Unnecessary warning from Partition when adding an 
application
 Key: YUNIKORN-2633
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2633
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The following is printed when adding an application:

{noformat}
2024-05-17T21:53:04.716+0200WARNcore.scheduler.queue
scheduler/partition.go:344  Trying to set resources on a queue that is not 
an unmanaged leaf{"queueName": "root.default"}
{noformat}

This message is supposed to be printed when the application defines a 
guaranteed or max resource. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2613) Release notes for 1.5.1

2024-05-17 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2613.

Fix Version/s: 1.5.1
   Resolution: Fixed

> Release notes for 1.5.1
> ---
>
> Key: YUNIKORN-2613
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2613
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2632) Data race in IncAllocatedResource

2024-05-17 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2632.

Fix Version/s: 1.6.0
   1.5.2
   Resolution: Fixed

> Data race in IncAllocatedResource
> -
>
> Key: YUNIKORN-2632
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2632
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.2
>
>
> After YUNIKORN-2548, we accidentally make an unlocked access to 
> \{{Queue.allocatedResource}}.
> {noformat}
> WARNING: DATA RACE
> Read at 0x00c000578a00 by goroutine 52:
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).IncAllocatedResource()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1032
>  +0x6b
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1495
>  +0x184
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes.func1()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1402
>  +0x144
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode.func1()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/node_iterator.go:42
>  +0x95
>   github.com/google/btree.(*node[go.shape.interface { 
> Less(github.com/google/btree.Item) bool }]).iterate()
>   
> /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:522 
> +0x6f1
>   github.com/google/btree.(*node[go.shape.interface { 
> Less(github.com/google/btree.Item) bool }]).iterate()
>   
> /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 
> +0x448
>   github.com/google/btree.(*node[go.shape.interface { 
> Less(github.com/google/btree.Item) bool }]).iterate()
>   
> /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 
> +0x448
>   github.com/google/btree.(*node[go.shape.interface { 
> Less(github.com/google/btree.Item) bool }]).iterate()
>   
> /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 
> +0x448
>   github.com/google/btree.(*BTreeG[go.shape.interface { 
> Less(github.com/google/btree.Item) bool }]).Ascend()
>   
> /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:779 
> +0x108
>   github.com/google/btree.(*BTree).Ascend()
>   
> /home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:1029 
> +0x108
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode()
> ...
> Previous write at 0x00c000578a00 by goroutine 49:
>   
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).DecAllocatedResource()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1101
>  +0x212
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/partition.go:1357
>  +0x17b4
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:870
>  +0xba
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:750
>  +0x1e4
>   github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:133
>  +0x28d
>   
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.gowrap1()
>   
> /home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:60
>  +0x33
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2632) Data race in IncAllocatedResource

2024-05-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2632:
--

 Summary: Data race in IncAllocatedResource
 Key: YUNIKORN-2632
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2632
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


After YUNIKORN-2548, we accidentally make an unlocked access to 
\{{Queue.allocatedResource}}.

{noformat}
WARNING: DATA RACE
Read at 0x00c000578a00 by goroutine 52:
  
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).IncAllocatedResource()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1032
 +0x6b
  github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNode()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1495
 +0x184
  
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).tryNodes.func1()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/application.go:1402
 +0x144
  
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode.func1()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/node_iterator.go:42
 +0x95
  github.com/google/btree.(*node[go.shape.interface { 
Less(github.com/google/btree.Item) bool }]).iterate()
  
/home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:522 
+0x6f1
  github.com/google/btree.(*node[go.shape.interface { 
Less(github.com/google/btree.Item) bool }]).iterate()
  
/home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 
+0x448
  github.com/google/btree.(*node[go.shape.interface { 
Less(github.com/google/btree.Item) bool }]).iterate()
  
/home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 
+0x448
  github.com/google/btree.(*node[go.shape.interface { 
Less(github.com/google/btree.Item) bool }]).iterate()
  
/home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:510 
+0x448
  github.com/google/btree.(*BTreeG[go.shape.interface { 
Less(github.com/google/btree.Item) bool }]).Ascend()
  
/home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:779 
+0x108
  github.com/google/btree.(*BTree).Ascend()
  
/home/bacskop/go/pkg/mod/github.com/google/btree@v1.1.2/btree_generic.go:1029 
+0x108
  
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*treeIterator).ForEachNode()
...
Previous write at 0x00c000578a00 by goroutine 49:
  
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Queue).DecAllocatedResource()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/objects/queue.go:1101
 +0x212
  
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/partition.go:1357
 +0x17b4
  
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:870
 +0xba
  
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/context.go:750
 +0x1e4
  github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:133
 +0x28d
  
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService.gowrap1()
  
/home/bacskop/go/pkg/mod/github.com/apache/yunikorn-core@v1.5.1-1/pkg/scheduler/scheduler.go:60
 +0x33
 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2629) Adding a node can result in a deadlock

2024-05-16 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2629:
--

 Summary: Adding a node can result in a deadlock
 Key: YUNIKORN-2629
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Adding a new node after Yunikorn state initialization can result in a deadlock.

The problem is that {{Context.addNode()}} holds a lock while we're waiting for 
the {{NodeAccepted}} event:
{noformat}
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event 
interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID, 
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode({
Nodes: nodesToRegister,
RmID:  schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes", 
zap.Error(err))
return nil, err
}

// wait for all responses to accumulate
wg.Wait()  <--- shim gets stuck here
 {noformat}
If tasks are being processed, then the dispatcher will try to retrieve the 
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v)  <--- 
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)  
{noformat}

Since {{addNode()}} is holding a write lock, the event processing loop gets 
stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2612) Tagging for 1.5.1

2024-05-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2612.

Fix Version/s: 1.5.1
   Resolution: Fixed

> Tagging for 1.5.1
> -
>
> Key: YUNIKORN-2612
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2612
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: release
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Fix For: 1.5.1
>
>
> Tagging for updating dependencies (SI/core/k8shim).
> No branching is needed because we'll deliver the release from branch-1.5 
> directly as we did with incubator minor releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2623) Create unit tests for Clients

2024-05-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2623.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Create unit tests for Clients
> -
>
> Key: YUNIKORN-2623
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2623
> Project: Apache YuniKorn
>  Issue Type: Test
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Follow-up on YUNIKORN-2621.
> Create proper coverage for {{{}clients.Clients{}}}. See PR comment 
> https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2623) Create unit test coverage for Clients

2024-05-13 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2623:
--

 Summary: Create unit test coverage for Clients
 Key: YUNIKORN-2623
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2623
 Project: Apache YuniKorn
  Issue Type: Test
  Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Follow-up on YUNIKORN-2621.

Create proper coverage for {{{}clients.Clients{}}}. See PR comment 
https://github.com/apache/yunikorn-k8shim/pull/838#issuecomment-2105557568.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2620) Remove redundant variable `errorExpected` from configvalidator_test.go

2024-05-11 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2620.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Remove redundant variable `errorExpected` from configvalidator_test.go
> --
>
> Key: YUNIKORN-2620
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2620
> Project: Apache YuniKorn
>  Issue Type: Improvement
>Reporter: Chia-Ping Tsai
>Assignee: Yun Sun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> This is similar to YUNIKORN-2598. We can check the existent of `validateFunc` 
> instead of having a extra boolean flag.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2614) Update website for 1.5.1

2024-05-08 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2614:
--

 Summary: Update website for 1.5.1
 Key: YUNIKORN-2614
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2614
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: release
Reporter: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2613) Release notes for 1.5.1

2024-05-08 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2613:
--

 Summary: Release notes for 1.5.1
 Key: YUNIKORN-2613
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2613
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2612) Tagging for 1.5.1

2024-05-08 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2612:
--

 Summary: Tagging for 1.5.1
 Key: YUNIKORN-2612
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2612
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2611) [UMBRELLA] YuniKorn 1.5.1 release efforts

2024-05-08 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2611:
--

 Summary: [UMBRELLA] YuniKorn 1.5.1 release efforts
 Key: YUNIKORN-2611
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2611
 Project: Apache YuniKorn
  Issue Type: Task
  Components: release
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2600) Update K8s dependency to 1.29.4

2024-05-06 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2600.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

Merged to master and branch-1.5.

> Update K8s dependency to 1.29.4
> ---
>
> Key: YUNIKORN-2600
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2600
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> A security vulnerability was fixed in 1.29.4. Update K8s dependency to this 
> version.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2602) Fix spelling/grammar in configvalidator

2024-05-04 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2602:
--

 Summary: Fix spelling/grammar in configvalidator
 Key: YUNIKORN-2602
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2602
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - common
Reporter: Peter Bacsko


Let's fix some minor grammar issues in configvalidator.

Eg.: "existed" -> "existing", but there could be other mistakes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2600) Update K8s dependency to 1.29.4

2024-05-03 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2600:
--

 Summary: Update K8s dependency to 1.29.4
 Key: YUNIKORN-2600
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2600
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko


A security vulnerability was fixed in 1.29.4. Update K8s dependency to this 
version.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2472) REST API returns subtree by default

2024-05-03 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2472.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> REST API returns subtree by default
> ---
>
> Key: YUNIKORN-2472
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2472
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common, documentation
>Affects Versions: 1.5.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Ted Lin
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> The subtree query parameter is interpreted the opposite of what would be 
> expected.
> If you call {{/ws/v1/partition/default/queue/root?subtree}} then you do not 
> get the subtree. If you call {{/ws/v1/partition/default/queue/root}} you get 
> the whole tree rooted at root
> We have not documented the new API yet so before we add it to the docs we 
> should fix the behaviour:
>  * subtree given: return the whole tree
>  * subtree missing: return one level
> The code fix is as simple as a ! in a single call and inverting the test 
> cases to pass or not pass {{?subtree}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2573) Flaky test TestUpdateNodeCapacityWithMultipleNodes

2024-05-03 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2573.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Flaky test TestUpdateNodeCapacityWithMultipleNodes
> --
>
> Key: YUNIKORN-2573
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2573
> Project: Apache YuniKorn
>  Issue Type: Bug
>Reporter: Arthur Wang
>Assignee: Arthur Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> [github 
> pipeline|https://github.com/apache/yunikorn-core/actions/runs/8770718393/job/24067600801]
> Github CI occasionally fail.
>  
> Root cause:
> [https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/context.go#L665]
>  
> {code:java}
> partition.updatePartitionResource(node.SetCapacity(resources.NewResourceFromProto(sr)))
>  {code}
>  
> We calculate the delta resources by updating node capacity.
> Then we update resources map in partition.
> The test would failed with following order
> node.SetCapacity() -> 
> [waitForAvailableNodeResource()|https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/tests/operation_test.go#L520]
>  ->  
> [partitionInfo.GetTotalPartitionResource()|https://github.com/apache/yunikorn-core/blob/a1a10f8e8621288c6919aad269540b44c6e20227/pkg/scheduler/tests/operation_test.go#L525]
>   -> partition.updatePartitionResource()



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2599) AppStateChange/AppTaskCompleted event cannot be handled in many states

2024-05-02 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2599:
--

 Summary: AppStateChange/AppTaskCompleted event cannot be handled 
in many states
 Key: YUNIKORN-2599
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2599
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: shim - yarn
Reporter: Peter Bacsko


After YUNIKORN-2597 got merged, it became clear that we keep sending an 
{{AppStateChange}} event which cannot be handled by the state machine. There 
isn't any state in the FSM object which would actually be able to process this 
event.

{{AppTaskCompleted}} is very similar, it is only processed in {{Resuming}} 
state.

If someone runs the test case TestApplicationScheduling, the following errors 
are displayed:
{noformat}
[...]
2024-05-02T18:08:14.856+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppStateChange", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
2024-05-02T18:08:14.856+0200INFOcore.scheduler.application  
[...] 
2024-05-02T18:08:14.857+0200INFOcore.scheduler.partition
scheduler/partition.go:928  scheduler allocation processed  {"appID": 
"app0001", "allocationKey": "task0002", "allocatedResource": 
"map[memory:1000 pods:1 vcore:1]", "placeholder": false, "targetNode": 
"test.host.02"}
2024-05-02T18:08:14.857+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppStateChange", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
[...]
2024-05-02T18:08:15.856+0200INFOshim.fsmcache/task_state.go:380 
Task state transition   {"app": "app0001", "task": "task0001", "taskAlias": 
"default/task0001", "source": "Bound", "destination": "Completed", "event": 
"CompleteTask"}
2024-05-02T18:08:15.856+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
[...]
2024-05-02T18:08:16.858+0200INFOshim.fsmcache/task_state.go:380 
Task state transition   {"app": "app0001", "task": "task0002", "taskAlias": 
"default/task0002", "source": "Bound", "destination": "Completed", "event": 
"CompleteTask"}
2024-05-02T18:08:16.858+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppTaskCompleted", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:123
github.com/apache/yunikorn-k8shim/pkg/dispatcher.Start.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/dispatcher/dispatcher.go:225
[...]
2024-05-02T18:08:16.859+0200ERROR   shim.contextcache/context.go:1316   
application event cannot be handled in the current state
{"applicationID": "app0001", "event": "AppStateChange", "state": "Running"}
github.com/apache/yunikorn-k8shim/pkg/shim.newShimSchedulerInternal.(*Context).ApplicationEventHandler.func1
/home/bacskop/repos/yunikorn-k8shim/pkg/cache/context.go:1316
github.com/apache/yunikorn-k8shim/pkg/dispatcher.getEventHandler.func1

[jira] [Resolved] (YUNIKORN-2597) Improve error messages in Context

2024-05-02 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2597.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

Merged to master & branch-1.5.

> Improve error messages in Context
> -
>
> Key: YUNIKORN-2597
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2597
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> The logging in {{cache.Context}} related to event handling needs some 
> improvement:
> 1) When an error occurs while the task event handler is being retrieved, it 
> logs "failed to handle application event" and the task ID is omitted, which 
> makes debugging harder.
> 2) If {{canHandle()}} returns false, we don't do anything, just return. 
> Again, this makes debugging much harder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2518) Allow recovery queue in REST requests

2024-05-02 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2518.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Allow recovery queue in REST requests
> -
>
> Key: YUNIKORN-2518
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2518
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common
>Reporter: Wilfred Spiegelenburg
>Assignee: Ted Lin
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 1.6.0
>
>
> The current checks for the REST requests that require a queue path to be 
> provided prevent looking at the {{root.@recover@}} queue.
> The validator filters the queue names which makes it impossible to check if 
> the queue has any running applications or pod after initialisation using the 
> REST requests. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2597) Fix error messages in Context

2024-04-30 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2597:
--

 Summary: Fix error messages in Context
 Key: YUNIKORN-2597
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2597
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The logging in Context related to event handling needs some improvement:

1) When an occur occurs when the task event handler is being retrieved, it logs 
"failed to handle application event" and the task ID is omitted, which makes 
debugging harder.
2) If {{canHandle()}} returns false, we don't do anything, just return. Again, 
this makes debugging much harder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2297) Update the unit test for CheckQueuesStructure

2024-04-30 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2297.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Update the unit test for CheckQueuesStructure
> -
>
> Key: YUNIKORN-2297
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2297
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - common, test - unit
>Reporter: Kuan Po Tseng
>Assignee: Yun Sun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> as discussed in 
> [https://github.com/apache/yunikorn-core/pull/763#discussion_r1439053430]
> there are many methods not covered by configvalidator_test.go e.g. 
> checkQueuesStructure checkQueues, .etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2583) Possible log spew on DEBUG level from objects.Node

2024-04-25 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2583.

Resolution: Won't Do

> Possible log spew on DEBUG level from objects.Node
> --
>
> Key: YUNIKORN-2583
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2583
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - common
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
>
> When a predicate fails for a given node, we print a message on DEBUG level. 
> The problem is, we keep doing this in every scheduling cycle, flooding the 
> logs:
> {noformat}
> 2024-04-17T21:32:19.446Z DEBUG core.scheduler.node objects/node.go:403 
> running predicates failed {"allocationKey": 
> "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", 
> "allocateFlag": true, "error": "predicates were not running because pod or 
> node was not found in cache"}
> 2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 
> running predicates failed {"allocationKey": 
> "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", 
> "allocateFlag": true, "error": "predicates were not running because pod or 
> node was not found in cache"}
> 2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 
> running predicates failed {"allocationKey": 
> "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", 
> "allocateFlag": true, "error": "predicates were not running because pod or 
> node was not found in cache"}
> {noformat}
> Another problematic part is {{preAllocateCheck()}} we have an allocation ask 
> with a zero resource:
> {noformat}
> unc (sn *Node) preAllocateCheck(res *resources.Resource, resKey string) bool {
>   // cannot allocate zero or negative resource
>   if !resources.StrictlyGreaterThanZero(res) {  
>   log.Log(log.SchedNode).Debug("pre alloc check: requested 
> resource is zero",
>   zap.String("nodeID", sn.NodeID))<-- will be printed 
> from every node
>   return false
>   }
> ...
> {noformat}
> We need to reduce the amount of output with RateLimitedLogger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2583) Possible log spew on DEBUG level when predicates are evaluated

2024-04-24 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2583:
--

 Summary: Possible log spew on DEBUG level when predicates are 
evaluated
 Key: YUNIKORN-2583
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2583
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - common
Reporter: Peter Bacsko
Assignee: Peter Bacsko


When a predicate fails for a given node, we print a message on DEBUG level. The 
problem is, we keep doing this in every scheduling cycle, flooding the logs:

{noformat}
2024-04-17T21:32:19.446Z DEBUG core.scheduler.node objects/node.go:403 running 
predicates failed {"allocationKey": "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", 
"nodeID": "kwok-node-zzn7w", "allocateFlag": true, "error": "predicates were 
not running because pod or node was not found in cache"}

2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 running 
predicates failed {"allocationKey": "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", 
"nodeID": "kwok-node-zzn7w", "allocateFlag": true, "error": "predicates were 
not running because pod or node was not found in cache"}

2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 running 
predicates failed {"allocationKey": "2be04314-bed0-4385-9ae7-50ed0ef9d9d5", 
"nodeID": "kwok-node-zzn7w", "allocateFlag": true, "error": "predicates were 
not running because pod or node was not found in cache"}
{noformat}

We need to reduce the amount of output with the RateLimitedLogger.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2544) [UMBRELLA] Fix Yunikorn potential locking issues

2024-04-22 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2544.

Resolution: Fixed

All subtasks have been resolved, closing ticket.

> [UMBRELLA] Fix Yunikorn potential locking issues
> 
>
> Key: YUNIKORN-2544
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2544
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> Go tool [go-deadlock|https://github.com/sasha-s/go-deadlock/] identified 
> several potential deadlocks in Yunikorn.
> Some of these do not cause problems right now, but a lock-related change in 
> the future can trigger a deadlock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2574) totalPartitionResource should not be mutated with AddTo/SubFrom

2024-04-22 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2574:
--

 Summary: totalPartitionResource should not be mutated with 
AddTo/SubFrom
 Key: YUNIKORN-2574
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2574
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Affects Versions: 1.5.0, 1.4.0
Reporter: Peter Bacsko
Assignee: Peter Bacsko


There is a potential data race in PartitionContext: the field 
"totalPartitionResource" is mutated in place. The problem is that the method 
{{GetTotalPartitionResource()}} does not clone it. 

{noformat}
func (pc *PartitionContext) GetTotalPartitionResource() *resources.Resource {
pc.RLock()
defer pc.RUnlock()

return pc.totalPartitionResource
}
{noformat}

In general, we should prefer the immutable approach for variables like this, 
just like in {{objects.Queue}}:
{noformat}
func (sq *Queue) IncAllocatedResource(alloc *resources.Resource, nodeReported 
bool) error {
// check this queue: failure stops checks if the allocation is not part 
of a node addition
newAllocated := resources.Add(sq.allocatedResource, alloc)
[ ... removed ... ]
sq.Lock()
defer sq.Unlock()
// all OK update this queue
sq.allocatedResource = newAllocated
sq.updateAllocatedResourceMetrics()
return nil
}

// incPendingResource increments pending resource of this queue and its parents.
func (sq *Queue) incPendingResource(delta *resources.Resource) {
// update the parent
if sq.parent != nil {
sq.parent.incPendingResource(delta)
}
// update this queue
sq.Lock()
defer sq.Unlock()
sq.pending = resources.Add(sq.pending, delta)
sq.updatePendingResourceMetrics()
}
{noformat}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2562) Nil pointer panic in Application.ReplaceAllocation()

2024-04-22 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2562.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

> Nil pointer panic in Application.ReplaceAllocation()
> 
>
> Key: YUNIKORN-2562
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
> Project: Apache YuniKorn
>  Issue Type: Bug
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> The following panic was generated during placeholder replacement:
> {noformat}
> 2024-04-16T13:46:58.583Z  INFOshim.cache.task cache/task.go:542   
> releasing allocations   {"numOfAsksToRelease": 1, 
> "numOfAllocationsToRelease": 1}
> 2024-04-16T13:46:58.583Z  INFOshim.fsmcache/task_state.go:380 
> Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
> "obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", 
> "source": "Bound", "destination": "Completed", "event": "CompleteTask"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.application  
> objects/application.go:616  ask removed successfully from application 
>   {"appID": "application-spark-abrdrsmo8no2", "ask": 
> "cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
> 2024-04-16T13:46:58.584Z  INFOcore.scheduler.partition
> scheduler/partition.go:1281 replacing placeholder allocation
> {"appID": "application-spark-abrdrsmo8no2", "allocationID": 
> "cd73be15-af61-4248-89e1-d3296e72214e"}
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]
> goroutine 117 [running]:
> github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
>  {0xc007710cf0, 0x24})
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
>  +0x615
> github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
>  0xc009786700)
>   
> github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 
> +0x28b
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
>  {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
> +0x9e
> github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
>  0xc0071a3f10?)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
> +0xa5
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
> +0x1c5
> created by 
> github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
> goroutine 1
>   github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
> +0x9c
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2568) Move all xxxEvents types to objects/events

2024-04-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2568:
--

 Summary: Move all xxxEvents types to objects/events
 Key: YUNIKORN-2568
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2568
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2566) Remove AllocationAsk reference from askEvents

2024-04-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2566:
--

 Summary: Remove AllocationAsk reference from askEvents
 Key: YUNIKORN-2566
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2566
 Project: Apache YuniKorn
  Issue Type: Sub-task
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2567) Remove Application reference from applicationEvents

2024-04-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2567:
--

 Summary: Remove Application reference from applicationEvents
 Key: YUNIKORN-2567
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2567
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2564) [Umbrella] Move xxxEvents types to a different package

2024-04-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2564:
--

 Summary: [Umbrella] Move xxxEvents types to a different package
 Key: YUNIKORN-2564
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2564
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


There are several Events that can be moved to a different package:

* queueEvents
* applicationEvents
* askEvents
* nodeEvents

There are numerous files in {{pkg/scheduler/objects}}. This is an opportunity 
to clean it up a bit and move these under eg. {{pkg/scheduler/objects/events}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2565) Remove Node reference from nodeEvents

2024-04-17 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2565:
--

 Summary: Remove Node reference from nodeEvents
 Key: YUNIKORN-2565
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2565
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2563) [shim] Enable deadlock detection during unit tests

2024-04-16 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2563:
--

 Summary: [shim] Enable deadlock detection during unit tests
 Key: YUNIKORN-2563
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2563
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: shim - kubernetes, test - unit
Reporter: Peter Bacsko






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2553) Enable deadlock detection during unit tests

2024-04-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2553.

 Fix Version/s: 1.6.0
1.5.1
Target Version: 1.6.0, 1.5.1
Resolution: Fixed

> Enable deadlock detection during unit tests
> ---
>
> Key: YUNIKORN-2553
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2553
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler, shim - kubernetes
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> To enhance testing, we can enable deadlock detection when the unit tests are 
> executed with "make test".
> This gives us instant feedback about locking issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2562) Nil pointer in Application.ReplaceAllocation()

2024-04-16 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2562:
--

 Summary: Nil pointer in Application.ReplaceAllocation()
 Key: YUNIKORN-2562
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2562
 Project: Apache YuniKorn
  Issue Type: Bug
  Components: core - scheduler
Reporter: Peter Bacsko


The following panic was generated during placeholder release:

{noformat}
2024-04-16T13:46:58.583ZINFOshim.cache.task cache/task.go:542   
releasing allocations   {"numOfAsksToRelease": 1, "numOfAllocationsToRelease": 
1}
2024-04-16T13:46:58.583ZINFOshim.fsmcache/task_state.go:380 
Task state transition   {"app": "application-spark-abrdrsmo8no2", "task": 
"cd73be15-af61-4248-89e1-d3296e72214e", "taskAlias": 
"obem-spark/tg-application-spark-abrdrsmo8n-spark-driver-y71h0amzo5", "source": 
"Bound", "destination": "Completed", "event": "CompleteTask"}
2024-04-16T13:46:58.584ZINFOcore.scheduler.application  
objects/application.go:616  ask removed successfully from application   
{"appID": "application-spark-abrdrsmo8no2", "ask": 
"cd73be15-af61-4248-89e1-d3296e72214e", "pendingDelta": "map[]"}
2024-04-16T13:46:58.584ZINFOcore.scheduler.partition
scheduler/partition.go:1281 replacing placeholder allocation
{"appID": "application-spark-abrdrsmo8no2", "allocationID": 
"cd73be15-af61-4248-89e1-d3296e72214e"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x17e1255]

goroutine 117 [running]:
github.com/apache/yunikorn-core/pkg/scheduler/objects.(*Application).ReplaceAllocation(0xc008c46600,
 {0xc007710cf0, 0x24})

github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/objects/application.go:1745
 +0x615
github.com/apache/yunikorn-core/pkg/scheduler.(*PartitionContext).removeAllocation(0x?,
 0xc009786700)

github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/partition.go:1284 +0x28b
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).processAllocationReleases(0xc00be64ba0?,
 {0xc00bb1af90, 0x1, 0x40a0fa?}, {0x1e0d902, 0x9})
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:870 
+0x9e
github.com/apache/yunikorn-core/pkg/scheduler.(*ClusterContext).handleRMUpdateAllocationEvent(0xc0005f5f58?,
 0xc0071a3f10?)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/context.go:750 
+0xa5
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).handleRMEvent(0xc000700540)
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:133 
+0x1c5
created by 
github.com/apache/yunikorn-core/pkg/scheduler.(*Scheduler).StartService in 
goroutine 1
github.com/apache/yunikorn-core@v1.5.0-3/pkg/scheduler/scheduler.go:60 
+0x9c
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2552) Recursive locking when sending remove queue event

2024-04-15 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2552.

Fix Version/s: 1.6.0
   1.5.1
   Resolution: Fixed

> Recursive locking when sending remove queue event
> -
>
> Key: YUNIKORN-2552
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2552
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> When sending a queue event from {{queueEvents}}, we acquire the read lock 
> again. 
> {noformat}
> objects.(*Queue).IsManaged { sq.RLock() } <
> objects.(*Queue).IsManaged { func (sq *Queue) IsManaged() bool { }
> objects.(*queueEvents).sendRemoveQueueEvent { } }
> objects.(*Queue).RemoveQueue { sq.queueEvents.sendRemoveQueueEvent() }
> scheduler.(*partitionManager).cleanQueues { // all OK update the queue 
> hierarchy and partition }
> scheduler.(*partitionManager).cleanQueues { if children := 
> queue.GetCopyOfChildren(); len(children) != 0 { }
> scheduler.(*partitionManager).cleanRoot { 
> manager.cleanQueues(manager.pc.root) }
> {noformat}
> {{RemoveQueue()}} already has the read lock.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2550) Fix locking in PartitionContext

2024-04-15 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2550.

 Fix Version/s: 1.6.0
1.5.1
Target Version: 1.6.0, 1.5.1
Resolution: Fixed

> Fix locking in PartitionContext
> ---
>
> Key: YUNIKORN-2550
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2550
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - common
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0, 1.5.1
>
>
> Possible deadlock was detected:
> {noformat}
> placement.(*AppPlacementManager).initialise { m.Lock() } <
> placement.(*AppPlacementManager).initialise { } }
> placement.(*AppPlacementManager).UpdateRules { 
> log.Log(log.Config).Info("Building new rule list for placement manager") }
> scheduler.(*PartitionContext).updatePartitionDetails { err := 
> pc.placementManager.UpdateRules(conf.PlacementRules) }
> scheduler.(*ClusterContext).updateSchedulerConfig { err = 
> part.updatePartitionDetails(p) }
> scheduler.(*ClusterContext).processRMConfigUpdateEvent { err = 
> cc.updateSchedulerConfig(conf, rmID) }
> scheduler.(*Scheduler).handleRMEvent { case *rmevent.RMConfigUpdateEvent: }
> scheduler.(*PartitionContext).GetQueue { pc.RLock() } <
> scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) 
> GetQueue(name string) *objects.Queue { }
> placement.(*providedRule).placeApplication { // if we cannot create the queue 
> must exist }
> placement.(*AppPlacementManager).PlaceApplication { queueName, err = 
> checkRule.placeApplication(app, m.queueFn) }
> scheduler.(*PartitionContext).AddApplication { err := 
> pc.getPlacementManager().PlaceApplication(app) }
> scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
> objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
> scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
> {noformat}
> Lock order is different between {{PartitionContext}} and 
> {{AppPlacementManager}}.
> There's also an interference between {{PartitionContext}} and an 
> {{Application}} object:
> {noformat}
> objects.(*Application).SetTerminatedCallback { sa.Lock() } <
> objects.(*Application).SetTerminatedCallback { func (sa *Application) 
> SetTerminatedCallback(callback func(appID string)) { }
> scheduler.(*PartitionContext).AddApplication { 
> app.SetTerminatedCallback(pc.moveTerminatedApp) }
> scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
> objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
> scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
> scheduler.(*PartitionContext).GetNode { pc.RLock() } <
> scheduler.(*PartitionContext).GetNode { func (pc *PartitionContext) 
> GetNode(nodeID string) *objects.Node { }
> objects.(*Application).tryPlaceholderAllocate { // resource usage should not 
> change anyway between placeholder and real one at this point }
> objects.(*Queue).TryPlaceholderAllocate { for _, app := range 
> sq.sortApplications(true) { }
> objects.(*Queue).TryPlaceholderAllocate { for _, child := range 
> sq.sortQueues() { }
> scheduler.(*PartitionContext).tryPlaceholderAllocate { alloc := 
> pc.root.TryPlaceholderAllocate(pc.GetNodeIterator, pc.GetNode) }
> scheduler.(*ClusterContext).schedule { // nothing reserved that can be 
> allocated try normal allocate }
> scheduler.(*Scheduler).MultiStepSchedule { // Note, this sleep only works in 
> tests. }
> tests.TestDupReleasesInGangScheduling { // and it waits for the shim's 
> confirmation }
> {noformat}
> There's no need to have a locked access for {{PartitionContext.nodes}}. The 
> base implementation of {{NodeCollection}} ({{baseNodeCollection}}) is already 
> internally synchronized. The "nodes" field is set once. Therefore, no locking 
> is necessary when accessing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2554) Remove "rules" field from PartitionContext

2024-04-12 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2554:
--

 Summary: Remove "rules" field from PartitionContext
 Key: YUNIKORN-2554
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2554
 Project: Apache YuniKorn
  Issue Type: Improvement
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


The "rules" field inside the PartitionContext is obsolete.

It is set but nothing reads it. It can also become out of sync with the 
contents of the placement manager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2549) Fixing lint issues

2024-04-12 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2549.

Fix Version/s: 1.6.0
   Resolution: Fixed

Merged to master.

> Fixing lint issues
> --
>
> Key: YUNIKORN-2549
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2549
> Project: Apache YuniKorn
>  Issue Type: Improvement
>  Components: shim - kubernetes
>Reporter: Michael Akinyemi
>Assignee: Michael Akinyemi
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> Fixing lint issues in test/e2e/framework/helpers/common/utils.go:
> test/e2e/framework/helpers/common/utils.go:60:9: wrapperFunc: use 
> strings.ReplaceAll method in `strings.Replace(name, "/", "-", -1)` (gocritic)
>     return strings.Replace(name, "/", "-", -1)
>            ^



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2553) Integrate deadlock detection with unit test

2024-04-11 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2553:
--

 Summary: Integrate deadlock detection with unit test
 Key: YUNIKORN-2553
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2553
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - scheduler, shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko


To enhance testing, we can enable deadlock detection when the unit test is 
executed with "make test".

This gives us instant feedback about locking issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2552) Recursive locking when sending Queue events

2024-04-11 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2552:
--

 Summary: Recursive locking when sending Queue events
 Key: YUNIKORN-2552
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2552
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - scheduler
Reporter: Peter Bacsko
Assignee: Peter Bacsko


When sending a queue event from {{queueEvents}}, we acquire the read lock 
again. 

{noformat}
objects.(*Queue).IsManaged { sq.RLock() } <
objects.(*Queue).IsManaged { func (sq *Queue) IsManaged() bool { }
objects.(*queueEvents).sendRemoveQueueEvent { } }
objects.(*Queue).RemoveQueue { sq.queueEvents.sendRemoveQueueEvent() }
scheduler.(*partitionManager).cleanQueues { // all OK update the queue 
hierarchy and partition }
scheduler.(*partitionManager).cleanQueues { if children := 
queue.GetCopyOfChildren(); len(children) != 0 { }
scheduler.(*partitionManager).cleanRoot { manager.cleanQueues(manager.pc.root) }
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Created] (YUNIKORN-2550) Fix locking in PartitionContext

2024-04-11 Thread Peter Bacsko (Jira)
Peter Bacsko created YUNIKORN-2550:
--

 Summary: Fix locking in PartitionContext
 Key: YUNIKORN-2550
 URL: https://issues.apache.org/jira/browse/YUNIKORN-2550
 Project: Apache YuniKorn
  Issue Type: Sub-task
  Components: core - common
Reporter: Peter Bacsko
Assignee: Peter Bacsko


Possible deadlock was detected:

{noformat}
~/repos/yunikorn-core/pkg/scheduler/partition.go:448 
scheduler.(*PartitionContext).GetQueue { pc.RLock() } <
~/repos/yunikorn-core/pkg/scheduler/partition.go:447 
scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) 
GetQueue(name string) *objects.Queue { }
~/repos/yunikorn-core/pkg/scheduler/placement/provided_rule.go:107 
placement.(*providedRule).placeApplication { // if we cannot create the queue 
must exist }
~/repos/yunikorn-core/pkg/scheduler/placement/placement.go:125 
placement.(*AppPlacementManager).PlaceApplication { queueName, err = 
checkRule.placeApplication(app, m.queueFn) }
~/repos/yunikorn-core/pkg/scheduler/partition.go:309 
scheduler.(*PartitionContext).AddApplication { err := 
pc.getPlacementManager().PlaceApplication(app) }
~/repos/yunikorn-core/pkg/scheduler/context.go:523 
scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
~/repos/yunikorn-core/pkg/scheduler/scheduler.go:130 
scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }

Lock order is different between {{PartitionContext}} and {{AppPlacementManager}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



[jira] [Resolved] (YUNIKORN-2543) Fix locking in RMProxy

2024-04-10 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YUNIKORN-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-2543.

Fix Version/s: 1.6.0
   Resolution: Fixed

> Fix locking in RMProxy
> --
>
> Key: YUNIKORN-2543
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2543
> Project: Apache YuniKorn
>  Issue Type: Sub-task
>  Components: core - scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.6.0
>
>
> After merging YUNIKORN-2539, we already saw a potential issue with 
> {{rmproxy.RMProxy}} and {{cache.Context}}:
> Gourutine 1:
> {noformat}
> github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:307
>  rmproxy.(*RMProxy).GetResourceManagerCallback ??? <
> github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:306
>  rmproxy.(*RMProxy).GetResourceManagerCallback ???
> github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:359
>  rmproxy.(*RMProxy).UpdateNode ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:1603 
> cache.(*Context).updateNodeResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:484 
> cache.(*Context).updateNodeOccupiedResources ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:392 
> cache.(*Context).updateForeignPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:286 
> cache.(*Context).UpdatePod ???
> {noformat}
> Goroutine 2:
> {noformat}
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:847 
> cache.(*Context).ForgetPod ??? <
> github.com/apache/yunikorn-k8shim/pkg/cache/context.go:846 
> cache.(*Context).ForgetPod ???
> github.com/apache/yunikorn-k8shim/pkg/cache/scheduler_callback.go:104 
> cache.(*AsyncRMCallback).UpdateAllocation ???
> github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:162
>  rmproxy.(*RMProxy).triggerUpdateAllocation ???
> github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:150
>  rmproxy.(*RMProxy).processRMReleaseAllocationEvent ???
> github.com/apache/yunikorn-core@v0.0.0-20240405160823-c94a7d938c41/pkg/rmproxy/rmproxy.go:234
>  rmproxy.(*RMProxy).handleRMEvents ???
> {noformat}
> Right now this seems to be safe because we only call {{RLock()}} in the 
> {{RMProxy}} methods. However, should any of this change, we're in trouble due 
> to lock ordering (Cache->RMProxy and RMProxy->Cache).
> We need to investigate why we use only {{RLock()}} and whether it's needed at 
> all. If nothing is modified, then we can drop the mutex completely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org



  1   2   3   4   5   >