I agree, but there is a core element here that might be worth considering
for IA, which is the ability to flag a node as [temporarily] unhealthy or
unavailable so application logic can use that as a part of the IA toolset.
Just a thought... :)

Thanks,.
Raymond.

On Fri, Feb 19, 2021 at 2:05 AM Ilya Kasnacheev <ilya.kasnach...@gmail.com>
wrote:

> Hello!
>
> This sounds like a too detailed and peculiar scenario that should be taken
> care of on the application level, as you already do.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 17 февр. 2021 г. в 23:50, Raymond Wilson <raymond_wil...@trimble.com>:
>
>> I Ilya,
>>
>> Sorry, that was a response to another problem!
>>
>> In this case, we have a more asynchronous mode of query-response where
>> the processing node can asynchronously send back a response to a query. The
>> reasons for this are: (1) Some responses are effectively streams of data
>> and we can't structure them as a single response, and (2) we can have
>> thousands of concurrent requests per node, which causes thread pool
>> exhaustion and response starvation due to the synchronous nature of the
>> IComputeFunc.Invoke() method.
>>
>> eg: We may have a request sequence like this where A, B and C are nodes
>> in the grid
>>
>> Request: A -> B -> C
>> Response: C -> B -> A
>>
>> If node B goes away unexpectedly, requests executing on 'C' can't send
>> their response and the request fails.
>>
>> From the perspective of A, it may attempt a retry after failing to
>> receive the response from B, but that's unsatisfactory for other reasons.
>>
>> I have built a POC that permits nodes to emit an application level
>> availability state which requestors can use to exclude certain nodes from
>> their request topology projections. This means a node being removed due to
>> auto-scale down or container scheduling can gracefully exit the grid after
>> ensuring the active requests it is involved in can complete normally. In
>> the case above, node B would be a client node providing services through a
>> web api gateway (A) and requesting results from co-located processing on
>> node C.
>>
>> Thanks,
>> Raymond.
>>
>>
>> On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> Hi Ilya,
>>>
>>> That is the current method we use to stop the grid.
>>>
>>> However, this can leave uncheckpointed changes in the in-memory stores
>>> (only in the WAL), so when we restart the grid it goes into the cache
>>> recovery mode which is very slow.
>>>
>>> Raymond.
>>>
>>> On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev <
>>> ilya.kasnach...@gmail.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> Why can't you just use Ignite.stop(instanceName, false)?
>>>>
>>>> Just make sure your projections are not singleton and the tasks will be
>>>> rolled over.
>>>>
>>>> Regards,
>>>> --
>>>> Ilya Kasnacheev
>>>>
>>>>
>>>> вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <raymond_wil...@trimble.com
>>>> >:
>>>>
>>>>> All,
>>>>>
>>>>> We have a very similar requirement as described in this item:
>>>>> https://issues.apache.org/jira/browse/IGNITE-10872
>>>>>
>>>>> Namely, when removing a node from a Ignite grid, we want to do two
>>>>> things:
>>>>>
>>>>> 1. Prevent new requests from reaching it
>>>>> 2. Allow all running requests the node is involved in to complete
>>>>> before it terminates.
>>>>>
>>>>> The solution outlined in 10872 partially solves these elements within
>>>>> our architecture in that it allows Ignite to pause shutdown of the node
>>>>> until all requests are completed (and, I assume, prevent new requests from
>>>>> reaching the node being shut down).
>>>>>
>>>>> In our architecture the phrase 'requests the node is involved in' made
>>>>> be opaque from the context on Ignite due to an asynchronous calling model
>>>>> we are using to permit very large numbers of concurrent requests to 
>>>>> execute
>>>>> without saturating the Ignite thread pools. What this means is that a node
>>>>> that may be a candidate to be shut down may be waiting for a response from
>>>>> another node on the grid in a way that Ignite can't see, so would 
>>>>> determine
>>>>> the node was safe to shut down when it is not.
>>>>>
>>>>> A good example of this in our system is an Apply style Ignite call
>>>>> where the request is sent to one of a set of nodes. That set of nodes may
>>>>> scale in/out due to request demand. On a scale in operation, the node to 
>>>>> be
>>>>> removed needs to be excluded from the topology projection constructed to
>>>>> perform the Apply() against. Once we are satisfied the node has no further
>>>>> request involved (eg: by a simple timeout) then we would proceed with
>>>>> actual shut down of that node.
>>>>>
>>>>> I have not seen any capability in Ignite today where a node can be
>>>>> 'un-blessed'; does one exist? Or should we construct this facility within
>>>>> our application logic layer?
>>>>>
>>>>> Thanks,
>>>>> Raymond.
>>>>>
>>>>>
>>>>> --
>>>>> <http://www.trimble.com/>
>>>>> Raymond Wilson
>>>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>>> raymond_wil...@trimble.com
>>>>>
>>>>>
>>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>>
>>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Solution Architect, Civil Construction Software Systems (CCSS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Solution Architect, Civil Construction Software Systems (CCSS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to