Hello! This sounds like a too detailed and peculiar scenario that should be taken care of on the application level, as you already do.
Regards, -- Ilya Kasnacheev ср, 17 февр. 2021 г. в 23:50, Raymond Wilson <raymond_wil...@trimble.com>: > I Ilya, > > Sorry, that was a response to another problem! > > In this case, we have a more asynchronous mode of query-response where the > processing node can asynchronously send back a response to a query. The > reasons for this are: (1) Some responses are effectively streams of data > and we can't structure them as a single response, and (2) we can have > thousands of concurrent requests per node, which causes thread pool > exhaustion and response starvation due to the synchronous nature of the > IComputeFunc.Invoke() method. > > eg: We may have a request sequence like this where A, B and C are nodes in > the grid > > Request: A -> B -> C > Response: C -> B -> A > > If node B goes away unexpectedly, requests executing on 'C' can't send > their response and the request fails. > > From the perspective of A, it may attempt a retry after failing to receive > the response from B, but that's unsatisfactory for other reasons. > > I have built a POC that permits nodes to emit an application level > availability state which requestors can use to exclude certain nodes from > their request topology projections. This means a node being removed due to > auto-scale down or container scheduling can gracefully exit the grid after > ensuring the active requests it is involved in can complete normally. In > the case above, node B would be a client node providing services through a > web api gateway (A) and requesting results from co-located processing on > node C. > > Thanks, > Raymond. > > > On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson <raymond_wil...@trimble.com> > wrote: > >> Hi Ilya, >> >> That is the current method we use to stop the grid. >> >> However, this can leave uncheckpointed changes in the in-memory stores >> (only in the WAL), so when we restart the grid it goes into the cache >> recovery mode which is very slow. >> >> Raymond. >> >> On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev < >> ilya.kasnach...@gmail.com> wrote: >> >>> Hello! >>> >>> Why can't you just use Ignite.stop(instanceName, false)? >>> >>> Just make sure your projections are not singleton and the tasks will be >>> rolled over. >>> >>> Regards, >>> -- >>> Ilya Kasnacheev >>> >>> >>> вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <raymond_wil...@trimble.com >>> >: >>> >>>> All, >>>> >>>> We have a very similar requirement as described in this item: >>>> https://issues.apache.org/jira/browse/IGNITE-10872 >>>> >>>> Namely, when removing a node from a Ignite grid, we want to do two >>>> things: >>>> >>>> 1. Prevent new requests from reaching it >>>> 2. Allow all running requests the node is involved in to complete >>>> before it terminates. >>>> >>>> The solution outlined in 10872 partially solves these elements within >>>> our architecture in that it allows Ignite to pause shutdown of the node >>>> until all requests are completed (and, I assume, prevent new requests from >>>> reaching the node being shut down). >>>> >>>> In our architecture the phrase 'requests the node is involved in' made >>>> be opaque from the context on Ignite due to an asynchronous calling model >>>> we are using to permit very large numbers of concurrent requests to execute >>>> without saturating the Ignite thread pools. What this means is that a node >>>> that may be a candidate to be shut down may be waiting for a response from >>>> another node on the grid in a way that Ignite can't see, so would determine >>>> the node was safe to shut down when it is not. >>>> >>>> A good example of this in our system is an Apply style Ignite call >>>> where the request is sent to one of a set of nodes. That set of nodes may >>>> scale in/out due to request demand. On a scale in operation, the node to be >>>> removed needs to be excluded from the topology projection constructed to >>>> perform the Apply() against. Once we are satisfied the node has no further >>>> request involved (eg: by a simple timeout) then we would proceed with >>>> actual shut down of that node. >>>> >>>> I have not seen any capability in Ignite today where a node can be >>>> 'un-blessed'; does one exist? Or should we construct this facility within >>>> our application logic layer? >>>> >>>> Thanks, >>>> Raymond. >>>> >>>> >>>> -- >>>> <http://www.trimble.com/> >>>> Raymond Wilson >>>> Solution Architect, Civil Construction Software Systems (CCSS) >>>> 11 Birmingham Drive | Christchurch, New Zealand >>>> raymond_wil...@trimble.com >>>> >>>> >>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >>>> >>> >> >> -- >> <http://www.trimble.com/> >> Raymond Wilson >> Solution Architect, Civil Construction Software Systems (CCSS) >> 11 Birmingham Drive | Christchurch, New Zealand >> raymond_wil...@trimble.com >> >> >> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >> > > > -- > <http://www.trimble.com/> > Raymond Wilson > Solution Architect, Civil Construction Software Systems (CCSS) > 11 Birmingham Drive | Christchurch, New Zealand > raymond_wil...@trimble.com > > > <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch> >