[
https://issues.apache.org/jira/browse/YUNIKORN-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440190#comment-17440190
]
Wilfred Spiegelenburg edited comment on YUNIKORN-936 at 11/8/21, 6:17 AM:
--------------------------------------------------------------------------
That could well be the case. I had not seen that jira. I based this jira on the
code analysis I did stepping through the process flow in my head while I was
reviewing the changes. I have not tracked the code in the debugger or looked at
logs at all when I logged this. I did some minor checks in the shim code to
make sure my assumptions on how that code worked were correct.
was (Author: wifreds):
That could well be the case. I had not seen that jira. I based this jira on the
code analysis I did stepping through the process flow in my head while I was
reviewing the changes. I have not tracked the code in the debugger or looked at
logs at all when I logged this. I did some minor checks in the shim code to
make sure my assumptions on how that code were correct.
> app and node recovery event ordering
> ------------------------------------
>
> Key: YUNIKORN-936
> URL: https://issues.apache.org/jira/browse/YUNIKORN-936
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - common
> Reporter: Wilfred Spiegelenburg
> Priority: Major
>
> While working on YUNIKORN-905 a number of unit tests failed due to event
> ordering. Looking at the change we might have had an issue in the RMProxy for
> a long time.
> An update request could contain apps, asks and nodes. Processing was ordered
> like that too. During recovery the order was/is important. There was never an
> order requirement on the events send by a shim or a use of complex updates
> events to support this ordering by the shim.
> An event to recover a node could be a separate UpdateRequest from the
> applications that should be recovered. That means we relied on the go routine
> and event ordering to hopefully do things correctly: i.e. events send by the
> shim to create new apps would be processed before node recovery started. Even
> in the previous implementation there was no guarantee that all the
> application were added before a node was recovered. The unit tests in the
> core used the order processing dependency to make sure it worked.
> That is not the real world scenario. and thus a dangerous assumption.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]