[Replying onto correct thread]

As a follow up to this email, we are starting to collect evidence that
replicated caches within our Ignite grid are failing to replicate values in
a small number of cases.

In the cases we observe so far, with a cluster of 4 nodes participating in
a replicated cache, only one node reports having the correct value for a
key, and the other three report having no value for that key.

The documentation is pretty opinionated about the
CacheWriteSynchronizationMode not being impactful with respect to
consistency for replicated caches. As noted below, we use PrimarySync (the
default) for these caches, which would suggest a potential failure mode
preventing the backup copies obtaining their copy once the primary copy has
been written.

We are continuing to investigate and would be interested in any
suggestions you may have as to the likely cause.

Thanks,
Raymond.

On Thu, Jul 27, 2023 at 12:38 PM Raymond Wilson <raymond_wil...@trimble.com>
wrote:

> Hi,
>
> I have a query regarding data safety of replicated caches in the case of
> hard failure of the compute resource but where the storage resource is
> available when the node returns.
>
> We are using Ignite 2.15 with the C# client.
>
> We have a number of these caches that have four nodes participating in the
> replicated caches, all with the default PrimarySync write synchronization
> mode. All data storage configurations are configured with WalMode =
> WalMode.Fsync.
>
> We have logic performing writes against these caches which will continue
> once the primary node for the replicated cache has written the data item.
>
> I am unsure of the guarantees made by Ignite at this point in the event of
> failure. Specifically, hard/red-button failure of compute hardware
> resources and/or abrupt (but recoverable) detachment of storage resources.
>
> Scenario one: Primary node returns "OK", then immediately fails (before
> check point). When the primary node returns should I expect the replicated
> value to be in the primary, and to appear in all other nodes too.
>
> Scenario two: Primary node returns "OK", then a secondary node immediately
> fails (before achieving the write and so before any check point). When the
> secondary node returns should I expect the replicated value to be in the
> recovered secondary node?
>
> In relation to these scenarios, does setting the cache write
> synchronization mode improve the safety of the write as all nodes must
> acknowledge the write before it returns.
>
> If there is an improvement in write safety in this instance, does this
> imply the Fsync WalMode write pathway has opportunities for data loss in
> these failure situations?
>
> Thanks,
> Raymond.
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>


-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to