Yu-Lin Chen created YUNIKORN-1998:
-------------------------------------

             Summary: Stale AdmissionControllerConf was used in e2e test
                 Key: YUNIKORN-1998
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1998
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: test - e2e
            Reporter: Yu-Lin Chen


In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before submit 
sleep pods.  
[https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70]

However, the AdmissionControllerConf in Admission controller doesn't 
immediately reflect the changes. The time gap leads following error in CI flow: 
 
[https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346]


We need to find a way to ensure AdmissionControllerConf has been updated before 
e2e test submits a new pods.


This issue can be reproduced if we introduce 1 second when the admission 
controller updates the AdmissionControllerConf. Here is the example code.
{code:java}
func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) {
    cm := utils.Convert2ConfigMap(newObj)
    
    // sleep 1 to deply AdmissionControllerConf update
    time.Sleep(1 * time.Second)
    
    if idx, ok := h.configMapIndex(cm); ok {
        h.conf.configUpdated(idx, cm)
    }
} {code}
[https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238]




Below are the before/after AM logs when we added 1 sec delay:



Success without delay 1 sec:
 * (AM configMapUpdateHandler)             05:36:39.184Z.  
AdmissionControllerConf trying to upgrade config
 * (AM configMapUpdateHandler)             05:36:39.185Z . 
AdmissionControllerConf config upgraded
 * (AM Webhook)                                       05:36:39.218Z.  AM 
receive AdmissionReview request
 * (AM Webhook)                                       05:36:39.221Z.  AM check 
Pods with new config,  E2E Test Passed 

Failed with delay 1 sec:
 * (AM configMapUpdateHandler)            08:19:31.025Z.   
AdmissionControllerConf trying to upgrade config
 * (AM Webhook)                                      08:19:31.067Z.   AM 
receive AdmissionReview request
 * (AM Webhook)                                      08:19:31.069Z    
{color:#FF0000}*AM check Pods with stale config*{color}, {color:#FF0000}*E2E 
Test failed* {color}
 * (AM configMapUpdateHandler)            08:19:32.026Z.  
AdmissionControllerConf config upgraded

In my kind cluster(v1.24.15), there is only 0.033 sec gap.  It's possible that 
the admission controller in CI flow runs those steps in different order.



The possible Solution:
1. In the e2e test, sleep for 1 seconds between updating the configmap and 
submitting new sleep pods. (It's quick fix, I assumed the time gap is always 
less than 1 sec ) 
2. In the e2e test, check current AdmissionControllerConf's value before submit 
a new pod. (How do client dump current AdmissionControllerConf? Need to seek 
for more advice.)

Since this issue only impact e2e test for now, we can go with solution #1 as a 
quick fix.  But it'll be better if we allow client to check whether 
AdmissionControllerConf is up-to-date.

Please kindly let me know if I have any misunderstandings. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to