I agree, but there is a core element here that might be worth considering for IA, which is the ability to flag a node as [temporarily] unhealthy or unavailable so application logic can use that as a part of the IA toolset. Just a thought... :)
Thanks,. Raymond. On Fri, Feb 19, 2021 at 2:05 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com> wrote: > Hello! > > This sounds like a too detailed and peculiar scenario that should be taken > care of on the application level, as you already do. > > Regards, > -- > Ilya Kasnacheev > > > ср, 17 февр. 2021 г. в 23:50, Raymond Wilson <raymond_wil...@trimble.com>: > >> I Ilya, >> >> Sorry, that was a response to another problem! >> >> In this case, we have a more asynchronous mode of query-response where >> the processing node can asynchronously send back a response to a query. The >> reasons for this are: (1) Some responses are effectively streams of data >> and we can't structure them as a single response, and (2) we can have >> thousands of concurrent requests per node, which causes thread pool >> exhaustion and response starvation due to the synchronous nature of the >> IComputeFunc.Invoke() method. >> >> eg: We may have a request sequence like this where A, B and C are nodes >> in the grid >> >> Request: A -> B -> C >> Response: C -> B -> A >> >> If node B goes away unexpectedly, requests executing on 'C' can't send >> their response and the request fails. >> >> From the perspective of A, it may attempt a retry after failing to >> receive the response from B, but that's unsatisfactory for other reasons. >> >> I have built a POC that permits nodes to emit an application level >> availability state which requestors can use to exclude certain nodes from >> their request topology projections. This means a node being removed due to >> auto-scale down or container scheduling can gracefully exit the grid after >> ensuring the active requests it is involved in can complete normally. In >> the case above, node B would be a client node providing services through a >> web api gateway (A) and requesting results from co-located processing on >> node C. >> >> Thanks, >> Raymond. >> >> >> On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson < >> raymond_wil...@trimble.com> wrote: >> >>> Hi Ilya, >>> >>> That is the current method we use to stop the grid. >>> >>> However, this can leave uncheckpointed changes in the in-memory stores >>> (only in the WAL), so when we restart the grid it goes into the cache >>> recovery mode which is very slow. >>> >>> Raymond. >>> >>> On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev < >>> ilya.kasnach...@gmail.com> wrote: >>> >>>> Hello! >>>> >>>> Why can't you just use Ignite.stop(instanceName, false)? >>>> >>>> Just make sure your projections are not singleton and the tasks will be >>>> rolled over. >>>> >>>> Regards, >>>> -- >>>> Ilya Kasnacheev >>>> >>>> >>>> вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <raymond_wil...@trimble.com >>>> >: >>>> >>>>> All, >>>>> >>>>> We have a very similar requirement as described in this item: >>>>> https://issues.apache.org/jira/browse/IGNITE-10872 >>>>> >>>>> Namely, when removing a node from a Ignite grid, we want to do two >>>>> things: >>>>> >>>>> 1. Prevent new requests from reaching it >>>>> 2. Allow all running requests the node is involved in to complete >>>>> before it terminates. >>>>> >>>>> The solution outlined in 10872 partially solves these elements within >>>>> our architecture in that it allows Ignite to pause shutdown of the node >>>>> until all requests are completed (and, I assume, prevent new requests from >>>>> reaching the node being shut down). >>>>> >>>>> In our architecture the phrase 'requests the node is involved in' made >>>>> be opaque from the context on Ignite due to an asynchronous calling model >>>>> we are using to permit very large numbers of concurrent requests to >>>>> execute >>>>> without saturating the Ignite thread pools. What this means is that a node >>>>> that may be a candidate to be shut down may be waiting for a response from >>>>> another node on the grid in a way that Ignite can't see, so would >>>>> determine >>>>> the node was safe to shut down when it is not. >>>>> >>>>> A good example of this in our system is an Apply style Ignite call >>>>> where the request is sent to one of a set of nodes. That set of nodes may >>>>> scale in/out due to request demand. On a scale in operation, the node to >>>>> be >>>>> removed needs to be excluded from the topology projection constructed to >>>>> perform the Apply() against. Once we are satisfied the node has no further >>>>> request involved (eg: by a simple timeout) then we would proceed with >>>>> actual shut down of that node. >>>>> >>>>> I have not seen any capability in Ignite today where a node can be >>>>> 'un-blessed'; does one exist? Or should we construct this facility within >>>>> our application logic layer? >>>>> >>>>> Thanks, >>>>> Raymond. >>>>> >>>>> >>>>> -- >>>>> <http://www.trimble.com/> >>>>> Raymond Wilson >>>>> Solution Architect, Civil Construction Software Systems (CCSS) >>>>> 11 Birmingham Drive | Christchurch, New Zealand >>>>> raymond_wil...@trimble.com >>>>> >>>>> >>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >>>>> >>>> >>> >>> -- >>> <http://www.trimble.com/> >>> Raymond Wilson >>> Solution Architect, Civil Construction Software Systems (CCSS) >>> 11 Birmingham Drive | Christchurch, New Zealand >>> raymond_wil...@trimble.com >>> >>> >>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >>> >> >> >> -- >> <http://www.trimble.com/> >> Raymond Wilson >> Solution Architect, Civil Construction Software Systems (CCSS) >> 11 Birmingham Drive | Christchurch, New Zealand >> raymond_wil...@trimble.com >> >> >> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >> > -- <http://www.trimble.com/> Raymond Wilson Solution Architect, Civil Construction Software Systems (CCSS) 11 Birmingham Drive | Christchurch, New Zealand raymond_wil...@trimble.com <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>