[ 
https://issues.apache.org/jira/browse/YUNIKORN-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492936#comment-17492936
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-946:
------------------------------------------------

[~anuraagn] it would be good to revisit this with a build that has the fix for 
YUNIKORN-876 added to this. A memory leak on the shim side was fixed which left 
pod references around. That leak could have also left the pods in the core and 
thus the web UI.

However there is a more likely candidate. YUNIKORN-766 which fixed the 
applicationID ordering. Two different app IDs for different executors which 
really belonged to the same app. That fact was mentioned in the slack channel 
when [~ashutosh-pepper] saw the leak. I think that was dismissed way to quickly.

> Accounting resources for deleted executor pods
> ----------------------------------------------
>
>                 Key: YUNIKORN-946
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-946
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 0.11
>            Reporter: Ashutosh Singh
>            Priority: Critical
>         Attachments: image-2021-11-16-23-17-42-819.png, 
> image-2021-11-16-23-18-28-349.png
>
>
> Even when executors are deleted, YK UI shows that resources are consumed by 
> the pod (the one which is already deleted). _kubectl get pods_  does not show 
> the executor but YK UI shows the information of a deleted pod consuming 
> resources even after few hours. 
> It results into leaking cluster resources.
> Steps:
>  # Run a spark application using k8s spark operator
>  # Wait for executors to be in running state.
>  # Delete the application using `kubectl delete sparkapplications <appName>` 
> OR `kubectl delete {-}{{-}}f <yaml\{-}file>`
>  # All the driver and executor pods would be deleted. check `kubectl get pods`
>  # However, YK UI still shows some of the executors running and consuming 
> resources. It leads to leak of the resource as they are considered as used 
> and could not be used by pending pods.
> More details: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1637126093006900]
> !image-2021-11-16-23-18-28-349.png|width=534,height=323!
>  
> !image-2021-11-16-23-17-42-819.png|width=583,height=353!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to