[ 
https://issues.apache.org/jira/browse/YUNIKORN-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493801#comment-17493801
 ] 

Anuraag Nalluri commented on YUNIKORN-946:
------------------------------------------

We can conclude that this bug is caused by the issue fixed in YUNIKORN-776.

To reach this conclusion, we built the scheduler for 2 commits – the ones 
preceding and following the merge of YUNIKORN-776. We ran spark-pi applications 
on both schedulers and supplied custom applicationId's which conflict with the 
default spark job IDs.  Before YUNIKORN-776, we can see the application can be 
initially created under the latter while the completion event surfaces to the 
dashboard for the custom applicationId we provided. This means the api-server's 
delete pod informed the incorrect application, thereby leaving the hanging 
allocation under the spark job ID's application.

In the commit following YUNIKORN-776, we started 3 spark-pi applications with a 
custom applicationIds. The allocation was both issued for and freed up from the 
provided applicationId in _all_ cases. This makes sense because the logic now 
always checks for applicationId first before the spark-generated app ID: 
[https://github.com/apache/incubator-yunikorn-k8shim/pull/288/files]

Attached screenshots to this ticket to show both of these scenarios. Thank you 
[~ashutosh-pepper] for reporting and [~wilfreds] for providing additional 
context. 

> Accounting resources for deleted executor pods
> ----------------------------------------------
>
>                 Key: YUNIKORN-946
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-946
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 0.11
>            Reporter: Ashutosh Singh
>            Assignee: Anuraag Nalluri
>            Priority: Critical
>         Attachments: image-2021-11-16-23-17-42-819.png, 
> image-2021-11-16-23-18-28-349.png
>
>
> Even when executors are deleted, YK UI shows that resources are consumed by 
> the pod (the one which is already deleted). _kubectl get pods_  does not show 
> the executor but YK UI shows the information of a deleted pod consuming 
> resources even after few hours. 
> It results into leaking cluster resources.
> Steps:
>  # Run a spark application using k8s spark operator
>  # Wait for executors to be in running state.
>  # Delete the application using `kubectl delete sparkapplications <appName>` 
> OR `kubectl delete {-}{{-}}f <yaml\{-}file>`
>  # All the driver and executor pods would be deleted. check `kubectl get pods`
>  # However, YK UI still shows some of the executors running and consuming 
> resources. It leads to leak of the resource as they are considered as used 
> and could not be used by pending pods.
> More details: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1637126093006900]
> !image-2021-11-16-23-18-28-349.png|width=534,height=323!
>  
> !image-2021-11-16-23-17-42-819.png|width=583,height=353!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to