Re: Another replicated cache oddity

Raymond Wilson Wed, 22 Nov 2023 16:13:05 -0800

Hi Jeremy,

Thanks for the insightful reply.


To cover your points (and fill in some info I should have provided):

- We use Ignite persistence
- We use the FSYNC Wal mode
- We do not use the FULL_SYNC cache write synchronization mode for our
REPLICATED caches (we use PRIMARY_SYNC). ... and because we use persistence
we have to rebuild the entire grid to be able to change it, which is a pain
when you have terabytes of data. I discovered this when investigating the
other REPLICATED cache oddities we have seen (which was an annoying
oversight). If we do ever rebuild I will definitely be using the FULL_SYNC
mode (BTW, the Ignite documentation is fairly unequivocal about this 'not
mattering' in terms of updating all the nodes, as long as you are not after
dependable low-latency read-after-write consistency for those caches (which
we don't right now)).

We do have good exception tracking around Put etc operations (but the 'Put'
event occurred so long ago any logging related to it would have been
retired, and the operation would have failed at a user level too in this
instance).

ConfigureAwait() can be a vexed topic. We use .Net 7 which has a different
async context model to .Net Framework, and we're not running async
operations in UX contexts (this is all web services back end
implementation), and we don't care about which .Net thread pool thread
execution resumes on. The async calls are awaited so we get exposure to
exceptions and have strong reporting and handling of those.

This particular issue has been observed only once. In terms of workload, it
is very bursty with relatively low (to no) baseline workload on the read
side, but fairly dependable baseload on the ingest side.

I am familiar with the WAL model, and we have chosen the FSYNC mode to
favour consistency/durability of our data. I hope the "the door is left
open in many places to allow things to slip between the cracks" caveat does
not apply in that scenario :)

Raymond.


On Thu, Nov 23, 2023 at 11:45 AM Jeremy McMillan <
jeremy.mcmil...@gridgain.com> wrote:

> Do you do ConfigureAwait(False) in the code that does *ICache<K,
> V>.PutAsync(), *or do you have some kind of handler to track whether
> there were problems with any particular put operation?
> https://ignite.apache.org/docs/latest/net-specific/net-async
> https://devblogs.microsoft.com/dotnet/configureawait-faq/
>
> The fastest way to do this basically says "don't know; don't care" whether
> any particular operation failed for any reason.
>
> There is also cache configuration which governs how the server
> acknowledges put operations.
>
> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/CacheWriteSynchronizationMode.html
>
> The fastest cache sync mode is FULL_ASYNC, which, like .Net
> ConfigureAwait(False), says the cache can acknowledge the put as done, but
> we don't know and don't care if any or all of the replicas have been
> completely written. The mode PRIMARY_SYNC only guarantees the primary
> partition has been updated. FULL_SYNC waits until all replicas have been
> successfully updated. Even if you have callbacks in .Net to track whether
> anything got missed in the PutAsync(), it only knows what the cache's sync
> mode can assert.
>
> If you are using persistence, Ignite writes changes to a partition first
> into a Write Ahead Log (WAL). Data in the WAL is not yet in the cache, but
> the cluster has it, and can finish an interrupted cache mutation if the WAL
> record for that operation is complete. If you write to the cache with an
> async call, and then you attempt to read the write from the cache
> immediately, you may be racing against the WAL sync. The WAL itself has
> multiple consistency modes. The default mode is LOG_ONLY which allows the
> OS to acknowledge the WAL records with your updates have been buffered,
> before they have been written to disk. It is possible for Ignite to think
> something has been written, but is immediately killed after that, and the
> MMAP buffer is not flushed to disk by the OS.
>
> https://ignite.apache.org/docs/2.11.1/persistence/native-persistence#wal-modes
>
> https://superuser.com/questions/1288890/will-buffer-be-automatically-flushed-to-disk-when-a-process-exits
>
> There are lots of tunables which allow any implementation to shift the
> priority from consistency to performance. If consistency is sacrificed for
> performance, especially under heavy load, some data loss can be possible
> because the door is left open in many places to allow things to slip
> between the cracks. I invite you experiment with FULL_SYNC cache mode and
> FSYNC WAL mode, and use synchronous puts in special threads when writing
> critical data to the cache, or at least use callbacks to detect whether
> something got interrupted, which also provides an opportunity to recover
> in case this is a corner case that doesn't justify a consistency-maximized
> cache setup.
>
> It's entirely possible your cache system is only set up to provide "best
> effort" consistency, while your compute tasks need more. At GridGain, this
> is often felt by our commercial clients when their clusters accrete new
> workloads that were not a part of the original design. I've only been with
> GridGain a short while, but I've heard this question raised more than once,
> and the tradeoffs and consistency design is well documented, so my guess is
> that's popular subject matter for similar reasons to your experience.
>
> Are there any signs of strain in any of the Ignite servers' monitoring
> metrics when these events happen? What kind of accumulated workload is on
> the cluster at those times, and how often does this happen?
>
>
> On Wed, Nov 22, 2023 at 1:50 PM Raymond Wilson <raymond_wil...@trimble.com>
> wrote:
>
>> Hi Jeremy,
>>
>> My initial query was to see if this had been observed by others.
>>
>> To answer some of your questions:
>>
>> Do we specifically check that an element is added to all nodes
>> participating in a replicated cache: No, we do not (we take it on trust
>> Ignite sorts that out ;) )
>>
>> Do we think it is a race condition? No, for three reasons: (1) The grid
>> was restarted in the interval between initial addition of the element and
>> the time the three nodes were failing to perform the Get(), (2) This
>> particular element failed on the same three nodes over many invocations of
>> the request over a substantial time period and (3) a subsequent grid
>> restart fixed the problem.
>>
>> From our logs we don't see delays, timeouts or Ignite logged errors
>> relating to the Get().
>>
>> In terms of troubleshooting this has been a bit tricky. In this instance
>> only this one element (of many thousands of similar elements with similar
>> cluster compute requests being made across them) failed. And only within
>> the duration between a pair of grid restarts.
>>
>> The replicated cache update is just a simple ICache<K, V>.PutAsync() with
>> a key struct and a byte[] array as payload. In terms of the distributed
>> compute code it is just performing a simple ICache<K, V>.GetAsync() with
>> the key struct.
>>
>> So far it seems like the three failing nodes just temporarily 'forgot'
>> they had this element, and remembered it again after the restart.
>>
>> For context, this is the first time we have seen this specific issue on a
>> system that has been running in production for 2+ years now. We have seen
>> numerous instances with replicated caches where Ignite has (permanently)
>> failed to write at least one, but not all, copies of the element where grid
>> restarts do not correct the issue. This does not feel the same though.
>>
>> Raymond.
>>
>>
>>
>>
>>
>> On Thu, Nov 23, 2023 at 6:50 AM Jeremy McMillan <
>> jeremy.mcmil...@gridgain.com> wrote:
>>
>>> I suspect a race condition with async mode caches. This is a naive guess
>>> though, as we don't have enough details. I'll assume this is a plea for
>>> help in troubleshooting methodology and the question is really "what should
>>> we look at next?"
>>>
>>> The real answer comes from tracing the insert of element E and
>>> subsequent cache get() failures. Do we know if E was completely inserted
>>> into each replicated cache backup partition prior to the get()? Do we know
>>> if the reported cache get() failure was actually a fully functioning cache
>>> lookup and retrieval that failed during lookup, or were there timeouts or
>>> exceptions indicating something abnormal was happening?
>>>
>>> What steps did you take to troubleshoot this issue, and what is the
>>> cluster and cache configuration in play? What does the code look like for
>>> the updates to the replicated cache, and what does the code look like for
>>> the distributed compute operation?
>>>
>>> On Tue, Nov 21, 2023 at 5:21 PM Raymond Wilson <
>>> raymond_wil...@trimble.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have been triaging an odd issue we encountered in a system using
>>>> Ignite v2.15 and the C# client.
>>>>
>>>> We have a replicated cache across four nodes, lets call them P0, P1, P2
>>>> & P3. Because the cache is replicated every item added to the cache is
>>>> present in each of P0, P1, P2 and P3.
>>>>
>>>> Some time ago an element (E) was added to this cache (among many
>>>> others). A number of system restarts have occurred since that time.
>>>>
>>>> We started observing an issue where a query running across P0/P1/P2/P3
>>>> as a cluster compute operation needed to load element E on each of the
>>>> nodes to perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all
>>>> reported that element E did not exist.
>>>>
>>>> This situation persisted until the cluster was restarted, after which
>>>> the same query that had been failing now succeeded as all four 'P' nodes
>>>> were able to read element E.
>>>>
>>>> There were no Ignite errors reported in the context of these
>>>> failing queries to indicate unhappiness in the Ignite nodes.
>>>>
>>>> This seems like very strange behaviour. Are there any suggestions as to
>>>> what could be causing this failure to read the replicated value on the
>>>> three failing nodes, especially as the element 'came back' after a cluster
>>>> restart?
>>>>
>>>> Thanks,
>>>> Raymond.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> <http://www.trimble.com/>
>>>> Raymond Wilson
>>>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>>>> 11 Birmingham Drive | Christchurch, New Zealand
>>>> raymond_wil...@trimble.com
>>>>
>>>>
>>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>>
>>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Re: Another replicated cache oddity

Reply via email to