-------
Preface
=======

There have been quite a few informal discussions on this topic, and it's time 
that we bring this formally to the yunikorn-dev mailing list for further 
discussion...


-------
Summary
=======

We are actively planning to deprecate the YuniKorn plugin mode for eventual 
removal. This has been an experimental feature since YuniKorn 1.0.0, but has 
not proven to be as stable or performant as our default deployment mode. 
Additionally, it has proven to be a large maintenance burden -- even for 
contributors who do not actively use it.


-------
History
=======

To adequately explain the current situation and why this is being planned, it's 
helpful to understand some of the history of both Kubernetes and YuniKorn and 
how they interact.

Approximately three years ago, the Kubernetes community decided to implement an 
internal Plugin API to help streamline the Kubernetes scheduler codebase. This 
API is also known as the Scheduling Framework [1]. At the time of the 
announcement, very few plugins had been implemented, and the API was positioned 
as a way to extend scheduler functionality in an easier fashion. The choice to 
name it a "plugin" API unfortunately invokes a lot of incorrect connotations, 
especially around intended use. When most developers think of "plugins" they 
think of 3rd party extensions to things like web browsers. The Kubernetes 
Scheduler Plugin API is an internal API framework, primarily meant for use by 
internal components, as evidenced by the fact that it only exists in the 
internal kubernetes project, and not in any of the externally visible (and 
public) modules. To make use of the Kubernetes scheduling framework, all 
plugins must be compiled together from source into a single unified scheduler 
binary.

At the time of the announcement, it seemed to those of us working on YuniKorn 
at Cloudera that this could provide a cleaner way for YuniKorn to integrate 
with Kubernetes and hopefully provide a version of YuniKorn which would have 
improved compatibility with the default Kubernetes scheduler. Work was begun on 
an internal prototype at Cloudera which had a number of significant limitations 
but did (somewhat) work. That prototype was largely rewritten and contributed 
upstream as part of YuniKorn 1.0.0 in May of 2022 and marked as experimental. 
Since YuniKorn 1.0, ongoing enhancements have been made to this feature. 
However, nearly two years after the initial public implementation, the plugin 
mode has not lived up to its promise and in fact has hindered progress on 
achieving a stable YuniKorn scheduler (more on this later).

In the mean time, much has changed in the implementation of the upstream 
Kubernetes scheduler. The scheduler has moved from a monolithic collection of 
features into a simple event loop that calls into scheduler plugins to perform 
all of the scheduling tasks. There is no longer any core functionality that is 
implemented outside of the plugins themselves.

Somewhat counterintuitively, this has resulted in increased stability for the 
standard YuniKorn deployment model. Prior to the existence of the plugin API, 
YuniKorn contained a lot of logic to essentially re-implement functionality 
from the default scheduler in the K8Shim. While this worked, it created 
potential incompatibilities as the two codebases evolved independently. As the 
plugin API became more stable and more core functionality was implemented with 
it, YuniKorn transitioned to calling into those plugins for that functionality. 
Today, the standard deployment of YuniKorn leverages all of the upstream 
Kubernetes scheduler functionality by calling into the same plugins that the 
default scheduler does. This means we have never been more compatible than we 
are today.

At the same time, we now have multiple years of data to indicate that the 
plugin version of YuniKorn has not improved compatibility or stability at all 
(in fact quite the opposite).


------------------------------------
YuniKorn -- Standard vs. plugin mode
====================================

In the standard YuniKorn deployment mode, YuniKorn acts as a standalone 
scheduler, grouping pods into applications, assigning those applications to 
queues, and processing the requests in those queues using configurable 
policies. When requests are satisfied, YuniKorn binds each pod to a node, and 
proceeds with the next request. As part of determining where (or if) a pod may 
be scheduled, YuniKorn calls into the default scheduler plugins to evaluate the 
suitability of a pod to a particular node. This means that as new plugins are 
added to the default scheduler, we automatically gain the same (compatible) 
functionality within YuniKorn simply by building (and testing) against a newer 
Kubernetes release.

When YuniKorn itself is built as a plugin to the default scheduler, the 
situation is much more complex. It's helpful to visualize the resulting 
scheduler as having a "split-brain" architecture. On the one side, we have 
YuniKorn operating much as it normally does, processing pods into applications 
and queues, making scheduling decisions (including calling into the official 
Kubernetes scheduler plugins). The one major difference is that pods are not 
bound by this scheduler, they are simply marked internally as ready. In the 
other half of the brain, we have the default Kubernetes scheduler codebase 
running, with a special "yunikorn" plugin defined as the last one in the plugin 
chain. This plugin implements primarily the PreFilter and Filter scheduler API 
functions. The PreFilter function is given a candidate pod and asked if it is 
schedulable. If that returns true, the Filter function is then called with the 
same candidate pod once for each possible node that may be schedulable and 
asked if that combination is valid. The "yunikorn" plugin PreFilter 
implementation simply returns true if the real YuniKorn scheduler has assigned 
a pod, and false otherwise. The Filter implementation checks that the node 
YuniKorn has assigned matches the requested node.

There are a number of limitations in the Plugin API that make this level of 
complexity necessary. By design, plugins are not allowed to interact with the 
scheduler directly, and must wait for plugin lifecycle methods (such as Filter 
and PreFilter) to be called on them by the scheduler. Plugins are also not 
allowed to interact with other plugins. YuniKorn requires both of these 
abilities in order to function at all.

Direct access to the scheduler is necessary in order to promote a pod back to a 
schedulable queue when it becomes ready. Since we do not have this ability when 
running in plugin mode, we have to resort to ugly hacks such as modifying a 
live pod in the API server so that the Kubernetes scheduler will pick it up and 
re-evaluate it.

YuniKorn needs to be able to interact with plugins to perform its own 
evaluations of (pod, node) combinations. Since we have no access to the plugin 
chain instantiated by the Kubernetes scheduler (and in fact no access to the 
scheduler object itself), we instantiate a parallel plugin chain with the same 
configuration. This means we have duplicate watchers, duplicate caches, 
duplicate plugins, and duplicate processing chains. Because of this, there is 
no guarantee which of the two halves of our "split-brain" scheduler will 
process a new pod first. If it happens to be YuniKorn, we mark the pod 
schedulable (assuming it fits) and wait for the Kubernetes scheduler to 
interact with the yunikorn plugin. However, if the Kubernetes scheduler picks 
it up first, it will immediately ask the yunikorn plugin whether or not the pod 
is schedulable, and since the plugin has no knowledge of it yet, it must 
respond negatively. This results in the pod being moved to the "unschedulable" 
queue within the Kubernetes scheduler, where it may remain for quite some time, 
leading to difficult-to-diagnose scheduling delays. Even worse, because there 
is parallel state being kept between the two schedulers, and the consistency of 
that state changes independently as cluster state changes, it's possible for 
the plugin chain that the Kubernetes scheduler uses and the one used internally 
by YuniKorn to arrive at different conclusions about whether a particular pod 
is schedulable on a particular node. When this happens, YuniKorn internally 
believes the pod is schedulable, and the Kubernetes scheduler does not, leading 
to a pod being left in limbo that doesn't make forward progress. We have 
observed this behavior in real clusters, and there really is no solution.

After almost three years working on this feature, we are still left with 
fundamentally unsolvable issues such as this that arise because of the 
inability to shoehorn YuniKorn's extensive functionality into the 
(purposefully) limited Scheduler Plugin API.

Due to all the duplicate processing processing and data structures required to 
implement YuniKorn as a plugin, as well as the inherent inefficiencies of the 
plugin API, we see scheduling throughput improvements of 2-4x and nearly half 
the memory usage when using the standard YuniKorn deployment mode vs. the 
plugin implementation. The standard deployment model is also much more stable, 
as there is a single source of truth for YuniKorn and scheduler plugins to use. 
Since we call into all the standard plugins as part of pod / node evaluation, 
we support ALL features that the default scheduler does within YuniKorn.


---------------------
Impact on development
=====================

The plugin feature also imposes a drain on the development process. It doubles 
our testing efforts, as we need to spin up twice as many end-to-end testing 
scenarios as before (one for each Kubernetes release we support x 2 for both 
scheduler implementations). Contributors often don't test with the plugin 
version early, and because the two models are architecturally very different, 
it's very common for developers to push a new PR, wait nearly an hour for the 
e2e tests to complete, only to find that the board is half green (standard 
mode) and half red (plugin mode). This results in increased dev cycles and a 
major loss in productivity which would be eliminated if we no longer needed to 
maintain two implementations.


------------------------
Impact on supportability
========================

Many of you may recall the pain caused during the YuniKorn 1.4.0 release cycle 
as we were forced to drop support for Kubernetes 1.23 and below from our 
support matrix. It was simply impossible to build a YuniKorn release that could 
work on both Kubernetes 1.23 and 1.27 simultaneously. The fact is, that 
limitation was caused by the existence of the plugin mode. Had we not been 
limited by having the plugin functionality integrated, we would have been able 
to build against newer Kubernetes releases and still function at runtime on 
older clusters. We discovered this at the time, but decided it was best to not 
fragment the release by having some builds available on old Kubernetes releases 
and others not.

The very low-level and internal nature of the plugin API causes this to be an 
ongoing risk for future release efforts as well. Considering that upstream 
Kubernetes is currently in discussions to fundamentally redesign how resources 
are used, this risk may become reality much sooner than we would like. It's not 
inconceivable that we may end up with a "flag day" of something like Kubernetes 
1.32 being unsupportable at the same time as 1.33 (versions are chosen for 
illustrative purposes, not predictive of when breakage may occur). This risk is 
much higher when deployment of YuniKorn in plugin mode is required.


------------------
Migration concerns
==================

For the most part, the standard and plugin deployment modes are interchangeable 
(by design). The activation of the plugin mode is done by setting the helm 
variable "enableSchedulerPlugin" to "true", so reverting to the standard mode 
can be as simple as setting that variable to "false". This is especially true 
if YuniKorn is being run with out-of-box default configuration. It is expected 
that the "enableSchedulerPlugin" attribute will be ignored, beginning with the 
same release where the plugin stops being enabled by default.

There is one area in which the two implementations differ behaviorally that may 
need to be addressed depending on how YuniKorn is being used. The YuniKorn 
admission controller supports a pair of configuration settings 
("admissionController.filtering.labelNamespaces" and 
"admissionController.filtering.noLabelNamespaces") which allow pods to be 
tagged with "schedulerName: yunikorn" but not have an Application ID assigned 
to them if one was not already present. This is typically used in plugin mode 
to send non-YuniKorn pods to the YuniKorn scheduler but have the normal 
YuniKorn queueing logic bypassed.

When using this feature, non-labeled pods arrive at the YuniKorn scheduler 
without an Application ID assigned, causing the yunikorn plugin to disable 
itself and use only the Kubernetes scheduler processing chain. In the standard 
YuniKorn deployment mode (as of YuniKorn 1.4+), these pods are automatically 
assigned a synthetic Application ID and processed in the same way as all other 
pods. Therefore, it is important to ensure that these pods are able to be 
mapped into an appropriate queue. When using the default, out of the box 
configuration, this already occurs, as YuniKorn ships with a single default 
queue and all pods map to it. However with custom configurations, it is 
necessary to ensure that a queue exists and existing workloads can map 
successfully to it (ideally via placement rules). For maximal compatibility, 
this queue should be unlimited in size (no quota). 

We understand that this is a gap in behavior that would need to be fixed when 
migrating from the plugin mode to the standard mode. We do not have a change or 
solution ready for that gap yet. However, the placement rules and queue 
configuration are flexible enough to allow us to create a fix for this. We 
believe we will be able to provide the first steps towards closing that gap as 
part of the next release.


------------------
Potential timeline
==================

There are known users in the community of the plugin feature, so care must be 
taken in how and when the feature is removed. We need to give users time to 
migrate. We propose the following release timeline:

- YuniKorn 1.6.0 - Announce the deprecation of the plugin model, but no code 
changes.
- YuniKorn 1.7.0 - Emit warnings when the plugin mode is active, but nothing 
else.
- YuniKorn 1.8.0 - Stop testing and building the plugin as part of the normal 
development cycle. [**]
- YuniKorn 1.9.0 - Remove the implementation entirely.

[**] We do not intend to break compilation of the plugin scheduler as part of 
the 1.8.0 release, but will no longer provide pre-compiled binaries. Users 
could still build the plugin themselves if required, but it would be untested 
and unsupported.

Given YuniKorn releases tend to arrive at approximately 4 month intervals, and 
we are midway through the 1.6.0 development cycle, this gives roughly 18 months 
until the feature will be removed completely (of course, this is only an 
estimate and not a commitment). For context, this is nearly as long as the 
feature has been available publicly at all.



---------------------------------
Frequently asked questions (FAQs)
=================================

- Why can't we just keep the existing plugin implementation around? Surely it 
can't be that difficult to maintain.

It's not simply a matter of difficulty in maintenance, though that is certainly 
a concern. There are several "if plugin mode" branches in the k8shim that would 
be eliminated. Additionally half of our e2e tests, which are run on every PR 
push, would no longer need to be run (and diagnosed when they fail). More 
importantly, we insulate ourselves from future Kubernetes code changes as we no 
longer need to reach as deep into private Kubernetes API. This has already 
proven to be an issue during the Kubernetes 1.23 - 1.24 transition, and is very 
likely to be an issue again. We would like to ensure support for the widest 
list of Kubernetes releases possible, and eliminating this code makes that much 
easier.


- Can YuniKorn support custom external scheduler plugins? Would this support 
change when the YuniKorn plugin mode no longer exists?

YuniKorn currently does not support building with external scheduler plugins. 
While that is in theory possible, due to the duplicate plugin lists that the 
Kubernetes and YuniKorn schedulers use, it is extremely complex and 
non-trivial. Even custom configuration for existing plugins is problematic. 
Eliminating yunikorn as a plugin actually makes this much more viable, as we 
could introduce functionality to customize the configuration of existing 
plugins, and users could patch YuniKorn with external plugins much more easily. 


- Don't I need the plugin mode in order to deploy YuniKorn on a large existing 
cluster without fear of breaking things?

No. In fact, using the plugin mode in this way introduces much more potential 
instability than the standard deployment does, and in fact means that instead 
of two schedulers in the cluster, you now have three (two of them just happen 
to live in the same process). The plugin mode is known to be slow, consume 
large amounts of memory, and be unstable under load.

It's also a myth that reusing the default scheduler code leads to better 
compatibility with the default scheduler. In addition to the instability plugin 
mode introduces, you are still building a custom scheduler that is very likely 
built against a different Kubernetes release than what your cluster is running. 
For example, unless you are running a Kubernetes 1.29.2 cluster with default 
configurations (which is what YuniKorn 1.5.0 uses), your scheduler 
implementation is not going to match the underlying cluster at all. This is one 
of the reasons we run extensive end-to-end testing to help catch potential 
issues, but this isn't something that improves by using plugin mode.

In short, regardless of which implementation you use, there's no substitute for 
adequate testing in non-prod environments. The perception may be there that 
plugin mode reduces this burden, but it really doesn't. It adds significant 
complexity and instability which cannot be addressed.

Since the default YuniKorn deployment mode calls into all the scheduler plugins 
just as the default Kubernetes scheduler does, and in much the same way, it 
actually has the highest compatibility with the default Kubernetes scheduler. 
This isn't just theoretical -- we have multiple years of data running both 
implementations on a large variety of clusters that bears this out. Standard 
mode simply works better.


--------------
External links 
--------------

[1] Scheduling Framework: 
https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

Reply via email to