[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472951#comment-17472951 ] Craig Condit commented on YUNIKORN-941: --- Committed #346 to master for admission controller changes. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472861#comment-17472861 ] Craig Condit commented on YUNIKORN-941: --- Committed #60 for helm chart changes, will commit #346 once e2e tests run successfully. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472261#comment-17472261 ] Craig Condit commented on YUNIKORN-941: --- PR #331 opened for shim-side changes, and #346 for release (helm chart) changes. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17469824#comment-17469824 ] Wilfred Spiegelenburg commented on YUNIKORN-941: The admission controller code has been updated since the draft of this change went in. I think it is better to finish the change from v1beta1 to the v1 version via YUNIKORN-938. It can be handled separately and without making changes to the way we do the certs etc. I have asked [~pbacsko] to look at that jira. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17461723#comment-17461723 ] Craig Condit commented on YUNIKORN-941: --- PR #346 has been opened as an alternative approach with the admission controller doing its own cert management and webhook registration on startup. This avoids the race conditions, and also doesn't require an init container which simplifies the setup dramatically. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Sub-task > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Craig Condit >Priority: Major > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457032#comment-17457032 ] Peter Bacsko commented on YUNIKORN-941: --- I think the commit needs to be reverted and we should start working on the replacement of {{admission_util.sh}} and leverage the {{initContainers}} approach. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457031#comment-17457031 ] Peter Bacsko commented on YUNIKORN-941: --- [~wwei] as Kinga explained, she ran into some unexpected issues regarding secrets. This is what happens when k8s wants to start the adm. controller: {noformat} Events: Type Reason Age From Message -- --- Normal Scheduled5m4s default-scheduler Successfully assigned yunikorn/yunikorn-admission-controller-5c46b58647-spxwk to yk8s-worker Warning FailedMount 3m1s kubeletUnable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[kube-api-access-55zht webhook-tls-certs]: timed out waiting for the condition Warning FailedMount 54s (x10 over 5m4s) kubelet MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found Warning FailedMount 47s kubeletUnable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[webhook-tls-certs kube-api-access-55zht]: timed out waiting for the condition {noformat} This is from https://github.com/apache/incubator-yunikorn-k8shim/runs/4440291100?check_suite_focus=true We can no longer create the secrets in the {{postStart}} / {{exec}} section. See Kinga's comment [above|https://issues.apache.org/jira/browse/YUNIKORN-941?focusedCommentId=17455091=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17455091]. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456681#comment-17456681 ] Weiwei Yang commented on YUNIKORN-941: -- PR https://github.com/apache/incubator-yunikorn-release/pull/50 caused the e2e failures, need more investigation on this. Attached the failure log in this JIRA. [~pbacsko], [~kmarton] please take a look. Thanks > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > Attachments: logs_322.zip > > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455372#comment-17455372 ] Chaoran Yu commented on YUNIKORN-941: - [~kmarton] Thanks for digging into it. I second your proposal > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Peter Bacsko >Priority: Blocker > Labels: pull-request-available > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17455091#comment-17455091 ] Kinga Marton commented on YUNIKORN-941: --- [~yuchaoran], [~wilfreds] I suggest to move out this issue from the 0.12 release. And remove the changes from the release repository from the 0.12 branch after it will be created. I am suggesting this because I found the root cause of the failing precommit: the secret is not created at the pint we want to mount it. The secret is created in the admission_util.sh script, what is running in a post start hook. And here we have a chicken and egg problem: * the secret needs the TLS certs, which are creeated fron the admission controller code, so in the actual setup we cannot create the secret in an init container. Instead of continuing to hack around the admission controller I suggest to remove the admission_util.sh script and use init containers for creating all the necessary certificates and secrets, but this is a bigger work. There is a good article about how we can create the admission controllers in a more elegant way than we are doing it now: [https://www.velotio.com/engineering-blog/managing-tls-certificate-for-kubernetes-admission-webhook] > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Blocker > Labels: pull-request-available > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452833#comment-17452833 ] Kinga Marton commented on YUNIKORN-941: --- Thank you [~yuchaoran2011] for the review in the release repository. Can you please check the shim side changes as well? This 2 are depending on each other. [https://github.com/apache/incubator-yunikorn-k8shim/pull/331] > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Blocker > Labels: pull-request-available > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17452463#comment-17452463 ] Kinga Marton commented on YUNIKORN-941: --- Some note on the newly created charts: * as [~yuchaoran2011] suggested, we will use a subchart for the admission controller. * since during a YK upgrade we want to make sure that no pods will be handled by the default scheduler during the YK downtime, it is essentially to have the admission controlller running when the scheduler will be upgraded. By using seubcharts this is possible with the following steps: ** update the admission controller (helm upgrade will do the upgrade only if there are some chnges, so for this steps we need to make sure that there are no changes in the scheduler) ** after the admission controller is updated, we can update the helm deployment again and include the scheduler changes as well. Since helm will detect the admission controller it is already up to date, it won't touch it. * We need to do the upgrade in this two steps, because during a normal upgrade helm aggregates all the manifests into one and then it will sort them according to their type and alphabetically, but will not wait for the dependeies being installed first. See more details in the following Helm documentation: [https://helm.sh/docs/topics/charts/#operational-aspects-of-using-dependencies] > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > Labels: pull-request-available > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org
[jira] [Commented] (YUNIKORN-941) split scheduler and admission controller deployment
[ https://issues.apache.org/jira/browse/YUNIKORN-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448112#comment-17448112 ] Kinga Marton commented on YUNIKORN-941: --- I created the following 2 PR's: [https://github.com/apache/incubator-yunikorn-release/pull/50] [https://github.com/apache/incubator-yunikorn-k8shim/pull/331] In this PR's I just moved away the admission controller related things from the scheduler image. However now we have it in a different deployment, independently from the scheduler, I mould moove forward and try to remove the admission_utils.sh script, and handle the admission controller from helm charts or from code, without running shell scripts. > split scheduler and admission controller deployment > --- > > Key: YUNIKORN-941 > URL: https://issues.apache.org/jira/browse/YUNIKORN-941 > Project: Apache YuniKorn > Issue Type: Improvement > Components: shim - kubernetes >Reporter: Kinga Marton >Assignee: Kinga Marton >Priority: Major > > To support proper YuniKorn upgrades and restarts we should move the admission > controller out of the scheduler deployment and make it a separate deployment. > This could also allow the admission controller to be made high available and > allow simpler no down time upgrades possible. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org For additional commands, e-mail: issues-h...@yunikorn.apache.org