------- Preface ======= There have been quite a few informal discussions on this topic, and it's time that we bring this formally to the yunikorn-dev mailing list for further discussion...
------- Summary ======= We are actively planning to deprecate the YuniKorn plugin mode for eventual removal. This has been an experimental feature since YuniKorn 1.0.0, but has not proven to be as stable or performant as our default deployment mode. Additionally, it has proven to be a large maintenance burden -- even for contributors who do not actively use it. ------- History ======= To adequately explain the current situation and why this is being planned, it's helpful to understand some of the history of both Kubernetes and YuniKorn and how they interact. Approximately three years ago, the Kubernetes community decided to implement an internal Plugin API to help streamline the Kubernetes scheduler codebase. This API is also known as the Scheduling Framework [1]. At the time of the announcement, very few plugins had been implemented, and the API was positioned as a way to extend scheduler functionality in an easier fashion. The choice to name it a "plugin" API unfortunately invokes a lot of incorrect connotations, especially around intended use. When most developers think of "plugins" they think of 3rd party extensions to things like web browsers. The Kubernetes Scheduler Plugin API is an internal API framework, primarily meant for use by internal components, as evidenced by the fact that it only exists in the internal kubernetes project, and not in any of the externally visible (and public) modules. To make use of the Kubernetes scheduling framework, all plugins must be compiled together from source into a single unified scheduler binary. At the time of the announcement, it seemed to those of us working on YuniKorn at Cloudera that this could provide a cleaner way for YuniKorn to integrate with Kubernetes and hopefully provide a version of YuniKorn which would have improved compatibility with the default Kubernetes scheduler. Work was begun on an internal prototype at Cloudera which had a number of significant limitations but did (somewhat) work. That prototype was largely rewritten and contributed upstream as part of YuniKorn 1.0.0 in May of 2022 and marked as experimental. Since YuniKorn 1.0, ongoing enhancements have been made to this feature. However, nearly two years after the initial public implementation, the plugin mode has not lived up to its promise and in fact has hindered progress on achieving a stable YuniKorn scheduler (more on this later). In the mean time, much has changed in the implementation of the upstream Kubernetes scheduler. The scheduler has moved from a monolithic collection of features into a simple event loop that calls into scheduler plugins to perform all of the scheduling tasks. There is no longer any core functionality that is implemented outside of the plugins themselves. Somewhat counterintuitively, this has resulted in increased stability for the standard YuniKorn deployment model. Prior to the existence of the plugin API, YuniKorn contained a lot of logic to essentially re-implement functionality from the default scheduler in the K8Shim. While this worked, it created potential incompatibilities as the two codebases evolved independently. As the plugin API became more stable and more core functionality was implemented with it, YuniKorn transitioned to calling into those plugins for that functionality. Today, the standard deployment of YuniKorn leverages all of the upstream Kubernetes scheduler functionality by calling into the same plugins that the default scheduler does. This means we have never been more compatible than we are today. At the same time, we now have multiple years of data to indicate that the plugin version of YuniKorn has not improved compatibility or stability at all (in fact quite the opposite). ------------------------------------ YuniKorn -- Standard vs. plugin mode ==================================== In the standard YuniKorn deployment mode, YuniKorn acts as a standalone scheduler, grouping pods into applications, assigning those applications to queues, and processing the requests in those queues using configurable policies. When requests are satisfied, YuniKorn binds each pod to a node, and proceeds with the next request. As part of determining where (or if) a pod may be scheduled, YuniKorn calls into the default scheduler plugins to evaluate the suitability of a pod to a particular node. This means that as new plugins are added to the default scheduler, we automatically gain the same (compatible) functionality within YuniKorn simply by building (and testing) against a newer Kubernetes release. When YuniKorn itself is built as a plugin to the default scheduler, the situation is much more complex. It's helpful to visualize the resulting scheduler as having a "split-brain" architecture. On the one side, we have YuniKorn operating much as it normally does, processing pods into applications and queues, making scheduling decisions (including calling into the official Kubernetes scheduler plugins). The one major difference is that pods are not bound by this scheduler, they are simply marked internally as ready. In the other half of the brain, we have the default Kubernetes scheduler codebase running, with a special "yunikorn" plugin defined as the last one in the plugin chain. This plugin implements primarily the PreFilter and Filter scheduler API functions. The PreFilter function is given a candidate pod and asked if it is schedulable. If that returns true, the Filter function is then called with the same candidate pod once for each possible node that may be schedulable and asked if that combination is valid. The "yunikorn" plugin PreFilter implementation simply returns true if the real YuniKorn scheduler has assigned a pod, and false otherwise. The Filter implementation checks that the node YuniKorn has assigned matches the requested node. There are a number of limitations in the Plugin API that make this level of complexity necessary. By design, plugins are not allowed to interact with the scheduler directly, and must wait for plugin lifecycle methods (such as Filter and PreFilter) to be called on them by the scheduler. Plugins are also not allowed to interact with other plugins. YuniKorn requires both of these abilities in order to function at all. Direct access to the scheduler is necessary in order to promote a pod back to a schedulable queue when it becomes ready. Since we do not have this ability when running in plugin mode, we have to resort to ugly hacks such as modifying a live pod in the API server so that the Kubernetes scheduler will pick it up and re-evaluate it. YuniKorn needs to be able to interact with plugins to perform its own evaluations of (pod, node) combinations. Since we have no access to the plugin chain instantiated by the Kubernetes scheduler (and in fact no access to the scheduler object itself), we instantiate a parallel plugin chain with the same configuration. This means we have duplicate watchers, duplicate caches, duplicate plugins, and duplicate processing chains. Because of this, there is no guarantee which of the two halves of our "split-brain" scheduler will process a new pod first. If it happens to be YuniKorn, we mark the pod schedulable (assuming it fits) and wait for the Kubernetes scheduler to interact with the yunikorn plugin. However, if the Kubernetes scheduler picks it up first, it will immediately ask the yunikorn plugin whether or not the pod is schedulable, and since the plugin has no knowledge of it yet, it must respond negatively. This results in the pod being moved to the "unschedulable" queue within the Kubernetes scheduler, where it may remain for quite some time, leading to difficult-to-diagnose scheduling delays. Even worse, because there is parallel state being kept between the two schedulers, and the consistency of that state changes independently as cluster state changes, it's possible for the plugin chain that the Kubernetes scheduler uses and the one used internally by YuniKorn to arrive at different conclusions about whether a particular pod is schedulable on a particular node. When this happens, YuniKorn internally believes the pod is schedulable, and the Kubernetes scheduler does not, leading to a pod being left in limbo that doesn't make forward progress. We have observed this behavior in real clusters, and there really is no solution. After almost three years working on this feature, we are still left with fundamentally unsolvable issues such as this that arise because of the inability to shoehorn YuniKorn's extensive functionality into the (purposefully) limited Scheduler Plugin API. Due to all the duplicate processing processing and data structures required to implement YuniKorn as a plugin, as well as the inherent inefficiencies of the plugin API, we see scheduling throughput improvements of 2-4x and nearly half the memory usage when using the standard YuniKorn deployment mode vs. the plugin implementation. The standard deployment model is also much more stable, as there is a single source of truth for YuniKorn and scheduler plugins to use. Since we call into all the standard plugins as part of pod / node evaluation, we support ALL features that the default scheduler does within YuniKorn. --------------------- Impact on development ===================== The plugin feature also imposes a drain on the development process. It doubles our testing efforts, as we need to spin up twice as many end-to-end testing scenarios as before (one for each Kubernetes release we support x 2 for both scheduler implementations). Contributors often don't test with the plugin version early, and because the two models are architecturally very different, it's very common for developers to push a new PR, wait nearly an hour for the e2e tests to complete, only to find that the board is half green (standard mode) and half red (plugin mode). This results in increased dev cycles and a major loss in productivity which would be eliminated if we no longer needed to maintain two implementations. ------------------------ Impact on supportability ======================== Many of you may recall the pain caused during the YuniKorn 1.4.0 release cycle as we were forced to drop support for Kubernetes 1.23 and below from our support matrix. It was simply impossible to build a YuniKorn release that could work on both Kubernetes 1.23 and 1.27 simultaneously. The fact is, that limitation was caused by the existence of the plugin mode. Had we not been limited by having the plugin functionality integrated, we would have been able to build against newer Kubernetes releases and still function at runtime on older clusters. We discovered this at the time, but decided it was best to not fragment the release by having some builds available on old Kubernetes releases and others not. The very low-level and internal nature of the plugin API causes this to be an ongoing risk for future release efforts as well. Considering that upstream Kubernetes is currently in discussions to fundamentally redesign how resources are used, this risk may become reality much sooner than we would like. It's not inconceivable that we may end up with a "flag day" of something like Kubernetes 1.32 being unsupportable at the same time as 1.33 (versions are chosen for illustrative purposes, not predictive of when breakage may occur). This risk is much higher when deployment of YuniKorn in plugin mode is required. ------------------ Migration concerns ================== For the most part, the standard and plugin deployment modes are interchangeable (by design). The activation of the plugin mode is done by setting the helm variable "enableSchedulerPlugin" to "true", so reverting to the standard mode can be as simple as setting that variable to "false". This is especially true if YuniKorn is being run with out-of-box default configuration. It is expected that the "enableSchedulerPlugin" attribute will be ignored, beginning with the same release where the plugin stops being enabled by default. There is one area in which the two implementations differ behaviorally that may need to be addressed depending on how YuniKorn is being used. The YuniKorn admission controller supports a pair of configuration settings ("admissionController.filtering.labelNamespaces" and "admissionController.filtering.noLabelNamespaces") which allow pods to be tagged with "schedulerName: yunikorn" but not have an Application ID assigned to them if one was not already present. This is typically used in plugin mode to send non-YuniKorn pods to the YuniKorn scheduler but have the normal YuniKorn queueing logic bypassed. When using this feature, non-labeled pods arrive at the YuniKorn scheduler without an Application ID assigned, causing the yunikorn plugin to disable itself and use only the Kubernetes scheduler processing chain. In the standard YuniKorn deployment mode (as of YuniKorn 1.4+), these pods are automatically assigned a synthetic Application ID and processed in the same way as all other pods. Therefore, it is important to ensure that these pods are able to be mapped into an appropriate queue. When using the default, out of the box configuration, this already occurs, as YuniKorn ships with a single default queue and all pods map to it. However with custom configurations, it is necessary to ensure that a queue exists and existing workloads can map successfully to it (ideally via placement rules). For maximal compatibility, this queue should be unlimited in size (no quota). We understand that this is a gap in behavior that would need to be fixed when migrating from the plugin mode to the standard mode. We do not have a change or solution ready for that gap yet. However, the placement rules and queue configuration are flexible enough to allow us to create a fix for this. We believe we will be able to provide the first steps towards closing that gap as part of the next release. ------------------ Potential timeline ================== There are known users in the community of the plugin feature, so care must be taken in how and when the feature is removed. We need to give users time to migrate. We propose the following release timeline: - YuniKorn 1.6.0 - Announce the deprecation of the plugin model, but no code changes. - YuniKorn 1.7.0 - Emit warnings when the plugin mode is active, but nothing else. - YuniKorn 1.8.0 - Stop testing and building the plugin as part of the normal development cycle. [**] - YuniKorn 1.9.0 - Remove the implementation entirely. [**] We do not intend to break compilation of the plugin scheduler as part of the 1.8.0 release, but will no longer provide pre-compiled binaries. Users could still build the plugin themselves if required, but it would be untested and unsupported. Given YuniKorn releases tend to arrive at approximately 4 month intervals, and we are midway through the 1.6.0 development cycle, this gives roughly 18 months until the feature will be removed completely (of course, this is only an estimate and not a commitment). For context, this is nearly as long as the feature has been available publicly at all. --------------------------------- Frequently asked questions (FAQs) ================================= - Why can't we just keep the existing plugin implementation around? Surely it can't be that difficult to maintain. It's not simply a matter of difficulty in maintenance, though that is certainly a concern. There are several "if plugin mode" branches in the k8shim that would be eliminated. Additionally half of our e2e tests, which are run on every PR push, would no longer need to be run (and diagnosed when they fail). More importantly, we insulate ourselves from future Kubernetes code changes as we no longer need to reach as deep into private Kubernetes API. This has already proven to be an issue during the Kubernetes 1.23 - 1.24 transition, and is very likely to be an issue again. We would like to ensure support for the widest list of Kubernetes releases possible, and eliminating this code makes that much easier. - Can YuniKorn support custom external scheduler plugins? Would this support change when the YuniKorn plugin mode no longer exists? YuniKorn currently does not support building with external scheduler plugins. While that is in theory possible, due to the duplicate plugin lists that the Kubernetes and YuniKorn schedulers use, it is extremely complex and non-trivial. Even custom configuration for existing plugins is problematic. Eliminating yunikorn as a plugin actually makes this much more viable, as we could introduce functionality to customize the configuration of existing plugins, and users could patch YuniKorn with external plugins much more easily. - Don't I need the plugin mode in order to deploy YuniKorn on a large existing cluster without fear of breaking things? No. In fact, using the plugin mode in this way introduces much more potential instability than the standard deployment does, and in fact means that instead of two schedulers in the cluster, you now have three (two of them just happen to live in the same process). The plugin mode is known to be slow, consume large amounts of memory, and be unstable under load. It's also a myth that reusing the default scheduler code leads to better compatibility with the default scheduler. In addition to the instability plugin mode introduces, you are still building a custom scheduler that is very likely built against a different Kubernetes release than what your cluster is running. For example, unless you are running a Kubernetes 1.29.2 cluster with default configurations (which is what YuniKorn 1.5.0 uses), your scheduler implementation is not going to match the underlying cluster at all. This is one of the reasons we run extensive end-to-end testing to help catch potential issues, but this isn't something that improves by using plugin mode. In short, regardless of which implementation you use, there's no substitute for adequate testing in non-prod environments. The perception may be there that plugin mode reduces this burden, but it really doesn't. It adds significant complexity and instability which cannot be addressed. Since the default YuniKorn deployment mode calls into all the scheduler plugins just as the default Kubernetes scheduler does, and in much the same way, it actually has the highest compatibility with the default Kubernetes scheduler. This isn't just theoretical -- we have multiple years of data running both implementations on a large variety of clusters that bears this out. Standard mode simply works better. -------------- External links -------------- [1] Scheduling Framework: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/