Re: Graceful shutdown and request draining of Ignite servers

Raymond Wilson Wed, 17 Feb 2021 12:50:14 -0800

I Ilya,

Sorry, that was a response to another problem!


In this case, we have a more asynchronous mode of query-response where the
processing node can asynchronously send back a response to a query. The
reasons for this are: (1) Some responses are effectively streams of data
and we can't structure them as a single response, and (2) we can have
thousands of concurrent requests per node, which causes thread pool
exhaustion and response starvation due to the synchronous nature of the
IComputeFunc.Invoke() method.

eg: We may have a request sequence like this where A, B and C are nodes in
the grid

Request: A -> B -> C
Response: C -> B -> A

If node B goes away unexpectedly, requests executing on 'C' can't send
their response and the request fails.

>From the perspective of A, it may attempt a retry after failing to receive
the response from B, but that's unsatisfactory for other reasons.

I have built a POC that permits nodes to emit an application level
availability state which requestors can use to exclude certain nodes from
their request topology projections. This means a node being removed due to
auto-scale down or container scheduling can gracefully exit the grid after
ensuring the active requests it is involved in can complete normally. In
the case above, node B would be a client node providing services through a
web api gateway (A) and requesting results from co-located processing on
node C.

Thanks,
Raymond.


On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson <raymond_wil...@trimble.com>
wrote:

> Hi Ilya,
>
> That is the current method we use to stop the grid.
>
> However, this can leave uncheckpointed changes in the in-memory stores
> (only in the WAL), so when we restart the grid it goes into the cache
> recovery mode which is very slow.
>
> Raymond.
>
> On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com>
> wrote:
>
>> Hello!
>>
>> Why can't you just use Ignite.stop(instanceName, false)?
>>
>> Just make sure your projections are not singleton and the tasks will be
>> rolled over.
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <raymond_wil...@trimble.com>:
>>
>>> All,
>>>
>>> We have a very similar requirement as described in this item:
>>> https://issues.apache.org/jira/browse/IGNITE-10872
>>>
>>> Namely, when removing a node from a Ignite grid, we want to do two
>>> things:
>>>
>>> 1. Prevent new requests from reaching it
>>> 2. Allow all running requests the node is involved in to complete before
>>> it terminates.
>>>
>>> The solution outlined in 10872 partially solves these elements within
>>> our architecture in that it allows Ignite to pause shutdown of the node
>>> until all requests are completed (and, I assume, prevent new requests from
>>> reaching the node being shut down).
>>>
>>> In our architecture the phrase 'requests the node is involved in' made
>>> be opaque from the context on Ignite due to an asynchronous calling model
>>> we are using to permit very large numbers of concurrent requests to execute
>>> without saturating the Ignite thread pools. What this means is that a node
>>> that may be a candidate to be shut down may be waiting for a response from
>>> another node on the grid in a way that Ignite can't see, so would determine
>>> the node was safe to shut down when it is not.
>>>
>>> A good example of this in our system is an Apply style Ignite call where
>>> the request is sent to one of a set of nodes. That set of nodes may scale
>>> in/out due to request demand. On a scale in operation, the node to be
>>> removed needs to be excluded from the topology projection constructed to
>>> perform the Apply() against. Once we are satisfied the node has no further
>>> request involved (eg: by a simple timeout) then we would proceed with
>>> actual shut down of that node.
>>>
>>> I have not seen any capability in Ignite today where a node can be
>>> 'un-blessed'; does one exist? Or should we construct this facility within
>>> our application logic layer?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Solution Architect, Civil Construction Software Systems (CCSS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Graceful shutdown and request draining of Ignite servers

Reply via email to