Re: Another replicated cache oddity

Jeremy McMillan Wed, 22 Nov 2023 14:45:15 -0800

Do you do ConfigureAwait(False) in the code that does *ICache<K,
V>.PutAsync(), *or do you have some kind of handler to track whether
there were problems with any particular put operation?
https://ignite.apache.org/docs/latest/net-specific/net-async
https://devblogs.microsoft.com/dotnet/configureawait-faq/

The fastest way to do this basically says "don't know; don't care" whether
any particular operation failed for any reason.

There is also cache configuration which governs how the server acknowledges
put operations.
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/CacheWriteSynchronizationMode.html

The fastest cache sync mode is FULL_ASYNC, which, like .Net
ConfigureAwait(False), says the cache can acknowledge the put as done, but
we don't know and don't care if any or all of the replicas have been
completely written. The mode PRIMARY_SYNC only guarantees the primary
partition has been updated. FULL_SYNC waits until all replicas have been
successfully updated. Even if you have callbacks in .Net to track whether
anything got missed in the PutAsync(), it only knows what the cache's sync
mode can assert.

If you are using persistence, Ignite writes changes to a partition first
into a Write Ahead Log (WAL). Data in the WAL is not yet in the cache, but
the cluster has it, and can finish an interrupted cache mutation if the WAL
record for that operation is complete. If you write to the cache with an
async call, and then you attempt to read the write from the cache
immediately, you may be racing against the WAL sync. The WAL itself has
multiple consistency modes. The default mode is LOG_ONLY which allows the
OS to acknowledge the WAL records with your updates have been buffered,
before they have been written to disk. It is possible for Ignite to think
something has been written, but is immediately killed after that, and the
MMAP buffer is not flushed to disk by the OS.
https://ignite.apache.org/docs/2.11.1/persistence/native-persistence#wal-modes
https://superuser.com/questions/1288890/will-buffer-be-automatically-flushed-to-disk-when-a-process-exits

There are lots of tunables which allow any implementation to shift the
priority from consistency to performance. If consistency is sacrificed for
performance, especially under heavy load, some data loss can be possible
because the door is left open in many places to allow things to slip
between the cracks. I invite you experiment with FULL_SYNC cache mode and
FSYNC WAL mode, and use synchronous puts in special threads when writing
critical data to the cache, or at least use callbacks to detect whether
something got interrupted, which also provides an opportunity to recover
in case this is a corner case that doesn't justify a consistency-maximized
cache setup.

It's entirely possible your cache system is only set up to provide "best
effort" consistency, while your compute tasks need more. At GridGain, this
is often felt by our commercial clients when their clusters accrete new
workloads that were not a part of the original design. I've only been with
GridGain a short while, but I've heard this question raised more than once,
and the tradeoffs and consistency design is well documented, so my guess is
that's popular subject matter for similar reasons to your experience.

Are there any signs of strain in any of the Ignite servers' monitoring
metrics when these events happen? What kind of accumulated workload is on
the cluster at those times, and how often does this happen?

On Wed, Nov 22, 2023 at 1:50 PM Raymond Wilson <raymond_wil...@trimble.com>
wrote:

> Hi Jeremy,
>
> My initial query was to see if this had been observed by others.
>
> To answer some of your questions:
>
> Do we specifically check that an element is added to all nodes
> participating in a replicated cache: No, we do not (we take it on trust
> Ignite sorts that out ;) )
>
> Do we think it is a race condition? No, for three reasons: (1) The grid
> was restarted in the interval between initial addition of the element and
> the time the three nodes were failing to perform the Get(), (2) This
> particular element failed on the same three nodes over many invocations of
> the request over a substantial time period and (3) a subsequent grid
> restart fixed the problem.
>
> From our logs we don't see delays, timeouts or Ignite logged errors
> relating to the Get().
>
> In terms of troubleshooting this has been a bit tricky. In this instance
> only this one element (of many thousands of similar elements with similar
> cluster compute requests being made across them) failed. And only within
> the duration between a pair of grid restarts.
>
> The replicated cache update is just a simple ICache<K, V>.PutAsync() with
> a key struct and a byte[] array as payload. In terms of the distributed
> compute code it is just performing a simple ICache<K, V>.GetAsync() with
> the key struct.
>
> So far it seems like the three failing nodes just temporarily 'forgot'
> they had this element, and remembered it again after the restart.
>
> For context, this is the first time we have seen this specific issue on a
> system that has been running in production for 2+ years now. We have seen
> numerous instances with replicated caches where Ignite has (permanently)
> failed to write at least one, but not all, copies of the element where grid
> restarts do not correct the issue. This does not feel the same though.
>
> Raymond.
>
>
>
>
>
> On Thu, Nov 23, 2023 at 6:50 AM Jeremy McMillan <
> jeremy.mcmil...@gridgain.com> wrote:
>
>> I suspect a race condition with async mode caches. This is a naive guess
>> though, as we don't have enough details. I'll assume this is a plea for
>> help in troubleshooting methodology and the question is really "what should
>> we look at next?"
>>
>> The real answer comes from tracing the insert of element E and subsequent
>> cache get() failures. Do we know if E was completely inserted into each
>> replicated cache backup partition prior to the get()? Do we know if the
>> reported cache get() failure was actually a fully functioning cache lookup
>> and retrieval that failed during lookup, or were there timeouts or
>> exceptions indicating something abnormal was happening?
>>
>> What steps did you take to troubleshoot this issue, and what is the
>> cluster and cache configuration in play? What does the code look like for
>> the updates to the replicated cache, and what does the code look like for
>> the distributed compute operation?
>>
>> On Tue, Nov 21, 2023 at 5:21 PM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> Hi,
>>>
>>> We have been triaging an odd issue we encountered in a system using
>>> Ignite v2.15 and the C# client.
>>>
>>> We have a replicated cache across four nodes, lets call them P0, P1, P2
>>> & P3. Because the cache is replicated every item added to the cache is
>>> present in each of P0, P1, P2 and P3.
>>>
>>> Some time ago an element (E) was added to this cache (among many
>>> others). A number of system restarts have occurred since that time.
>>>
>>> We started observing an issue where a query running across P0/P1/P2/P3
>>> as a cluster compute operation needed to load element E on each of the
>>> nodes to perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all
>>> reported that element E did not exist.
>>>
>>> This situation persisted until the cluster was restarted, after which
>>> the same query that had been failing now succeeded as all four 'P' nodes
>>> were able to read element E.
>>>
>>> There were no Ignite errors reported in the context of these
>>> failing queries to indicate unhappiness in the Ignite nodes.
>>>
>>> This seems like very strange behaviour. Are there any suggestions as to
>>> what could be causing this failure to read the replicated value on the
>>> three failing nodes, especially as the element 'came back' after a cluster
>>> restart?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Re: Another replicated cache oddity

Reply via email to