[ 
https://issues.apache.org/jira/browse/YUNIKORN-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440190#comment-17440190
 ] 

Wilfred Spiegelenburg edited comment on YUNIKORN-936 at 11/8/21, 6:17 AM:
--------------------------------------------------------------------------

That could well be the case. I had not seen that jira. I based this jira on the 
code analysis I did stepping through the process flow in my head while I was 
reviewing the changes. I have not tracked the code in the debugger or looked at 
logs at all when I logged this. I did some minor checks in the shim code to 
make sure my assumptions on how that code worked were correct.


was (Author: wifreds):
That could well be the case. I had not seen that jira. I based this jira on the 
code analysis I did stepping through the process flow in my head while I was 
reviewing the changes. I have not tracked the code in the debugger or looked at 
logs at all when I logged this. I did some minor checks in the shim code to 
make sure my assumptions on how that code were correct.

> app and node recovery event ordering
> ------------------------------------
>
>                 Key: YUNIKORN-936
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-936
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - common
>            Reporter: Wilfred Spiegelenburg
>            Priority: Major
>
> While working on YUNIKORN-905 a number of unit tests failed due to event 
> ordering. Looking at the change we might have had an issue in the RMProxy for 
> a long time.
> An update request could contain apps, asks and nodes. Processing was ordered 
> like that too. During recovery the order was/is important. There was never an 
> order requirement on the events send by a shim or a use of complex updates 
> events to support this ordering by the shim.
> An event to recover a node could be a separate UpdateRequest from the 
> applications that should be recovered. That means we relied on the go routine 
> and event ordering to hopefully do things correctly: i.e. events send by the 
> shim to create new apps would be processed before node recovery started. Even 
> in the previous implementation there was no guarantee that all the 
> application were added before a node was recovered. The unit tests in the 
> core used the order processing dependency to make sure it worked.
> That is not the real world scenario. and thus a dangerous assumption.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to