[ https://issues.apache.org/jira/browse/YUNIKORN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Bacsko resolved YUNIKORN-1998. ------------------------------------ Fix Version/s: 1.4.0 Resolution: Fixed > Stale AdmissionControllerConf was used in e2e test > -------------------------------------------------- > > Key: YUNIKORN-1998 > URL: https://issues.apache.org/jira/browse/YUNIKORN-1998 > Project: Apache YuniKorn > Issue Type: Bug > Components: test - e2e > Reporter: Yu-Lin Chen > Assignee: Yu-Lin Chen > Priority: Minor > Labels: pull-request-available > Fix For: 1.4.0 > > > In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before > submit sleep pods. > [https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70] > However, the AdmissionControllerConf in Admission controller doesn't > immediately reflect the changes. The time gap leads following error in CI > flow: > [https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346] > We need to find a way to ensure AdmissionControllerConf has been updated > before e2e test submits a new pods. > This issue can be reproduced if we introduce 1 second delay when the > admission controller updates the AdmissionControllerConf. Here is the example > code. > {code:java} > func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) { > cm := utils.Convert2ConfigMap(newObj) > > // sleep 1 to delay AdmissionControllerConf update > time.Sleep(1 * time.Second) > > if idx, ok := h.configMapIndex(cm); ok { > h.conf.configUpdated(idx, cm) > } > } {code} > [https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238] > Below are the before/after AM logs when we added 1 sec delay: > Success without delay 1 sec: > * (AM configMapUpdateHandler) 05:36:39.184Z. > AdmissionControllerConf trying to upgrade config > * (AM configMapUpdateHandler) > {color:#de350b}05:36:39.185Z{color} . AdmissionControllerConf config upgraded > * (AM Webhook) 05:36:39.218Z. AM > receive AdmissionReview request > * (AM Webhook) > {color:#de350b}05:36:39.221Z{color}. AM check Pods with new config, E2E > Test Passed > Failed with delay 1 sec: > * (AM configMapUpdateHandler) 08:19:31.025Z. > AdmissionControllerConf trying to upgrade config > * (AM Webhook) 08:19:31.067Z. AM > receive AdmissionReview request > * (AM Webhook) 08:19:31.069Z > {color:#ff0000}*AM check Pods with stale config*{color}, {color:#ff0000}*E2E > Test failed* {color} > * (AM configMapUpdateHandler) 08:19:32.026Z. > AdmissionControllerConf config upgraded > In my kind cluster(v1.24.15), there is only 0.036 sec gap > ({color:#de350b}05:36:39.185Z~ {color}{color:#de350b}05:36:39.221Z{color}). > It's possible that the admission controller in CI flow runs those steps in > different order. > The possible Solution: > 1. In the e2e test, sleep for 1 seconds between updating the configmap and > submitting new sleep pods. (It's quick fix, I assumed the time gap is always > less than 1 sec ) > 2. In the e2e test, check current AdmissionControllerConf's value before > submit a new pod. (How do client dump current AdmissionControllerConf? Need > to seek for more advice.) > Since this issue only impact e2e test for now, we can go with solution #1 as > a quick fix. But it'll be better if we allow client to check whether > AdmissionControllerConf is up-to-date. > Please kindly let me know if I have any misunderstandings. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org