As a follow up to this we have produced tooling which allows us to detect
and correct the problem. We are not entirely comfortable running control.sh
on production nodes (because, well, it's production :) ).

We have observed dozens of cases of this kind of corruption on two separate
Ignite grid instances. I believe we have seen sufficient numbers of this
issue to indicate there is an undiscovered consistency issue in Ignite with
replicated caches (perhaps only when using the PrimarySync cache
synchronization mode, and possibly related to failure mode handling if
Ignite nodes are terminated abruptly).

We do not have a reproducer unfortunately.

Raymond.


On Tue, Aug 22, 2023 at 7:16 PM Николай Ижиков <nizhi...@apache.org> wrote:

> Hello, Raymond.
>
> Usually, experimental is feature that can be changed in future.
> This statement relates to the public API of the feature usually.
>
> > Does this imply risk if run against a production environment grid?
>
> It depends.
> As for read repair, CHECK_ONLY is read only mode and can’t harm your data.
> Other modes that fix data inconsistency was used on our production and
> there are no known issues.
>
>
> 22 авг. 2023 г., в 03:12, Raymond Wilson <raymond_wil...@trimble.com>
> написал(а):
>
> Thanks for the pointer to the read repair facility added in Ignite 2.14.
>
> Unfortunately the .WithReadRepair() extension does not seem to be present
> in the Ignite C# client.
>
> This means we either need to use the experimental Command.sh support, or
> improve our tooling to effectively do the same. I am curious why this is
> labelled as experimental? Does this imply risk if run against a production
> environment grid?
>
> Raymond.
>
>
> On Mon, Aug 21, 2023 at 5:50 PM Николай Ижиков <nizhi...@apache.org>
> wrote:
>
>> Hello.
>>
>> I don’t know the cause of your issue.
>> But, we have feature to overcome it [1]
>>
>> Consistency repair can be run from control.sh.
>>
>> ```
>> ./bin/control.sh --enable-experimental
>> ...
>>   [EXPERIMENTAL]
>>   Check/Repair cache consistency using Read Repair approach:
>>     control.(sh|bat) --consistency repair cache-name partition
>>
>>     Parameters:
>>       cache-name  - Cache to be checked/repaired.
>>       partition   - Cache's partition to be checked/repaired.
>>
>>   [EXPERIMENTAL]
>>   Cache consistency check/repair operations status:
>>     control.(sh|bat) --consistency status
>>
>>   [EXPERIMENTAL]
>>   Finalize partitions update counters:
>>     control.(sh|bat) --consistency finalize
>> ```
>>
>> It seems that docs for a cmd command not full.
>> It also accepts strategy argument so you can manage your repair actions
>> more accurate.
>> Try to run:
>>
>> ```
>> ❯ ./bin/control.sh --enable-experimental --consistency repair --cache
>> default --strategy CHECK_ONLY --partitions 1,2,3,…your_partitions_list...
>> ```
>>
>> Available strategies with good description can be found in sources [2]
>>
>>
>> [1] https://ignite.apache.org/docs/latest/key-value-api/read-repair
>> [2]
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cache/ReadRepairStrategy.java
>>
>>
>>
>> 21 авг. 2023 г., в 07:46, Raymond Wilson <raymond_wil...@trimble.com>
>> написал(а):
>>
>> [Replying onto correct thread]
>>
>> As a follow up to this email, we are starting to collect evidence that
>> replicated caches within our Ignite grid are failing to replicate values in
>> a small number of cases.
>>
>> In the cases we observe so far, with a cluster of 4 nodes participating
>> in a replicated cache, only one node reports having the correct value for a
>> key, and the other three report having no value for that key.
>>
>> The documentation is pretty opinionated about the
>> CacheWriteSynchronizationMode not being impactful with respect to
>> consistency for replicated caches. As noted below, we use PrimarySync (the
>> default) for these caches, which would suggest a potential failure mode
>> preventing the backup copies obtaining their copy once the primary copy has
>> been written.
>>
>> We are continuing to investigate and would be interested in any
>> suggestions you may have as to the likely cause.
>>
>> Thanks,
>> Raymond.
>>
>> On Thu, Jul 27, 2023 at 12:38 PM Raymond Wilson <
>> raymond_wil...@trimble.com> wrote:
>>
>>> Hi,
>>>
>>> I have a query regarding data safety of replicated caches in the case of
>>> hard failure of the compute resource but where the storage resource is
>>> available when the node returns.
>>>
>>> We are using Ignite 2.15 with the C# client.
>>>
>>> We have a number of these caches that have four nodes participating in
>>> the replicated caches, all with the default PrimarySync write
>>> synchronization mode. All data storage configurations are configured with
>>> WalMode = WalMode.Fsync.
>>>
>>> We have logic performing writes against these caches which will continue
>>> once the primary node for the replicated cache has written the data item.
>>>
>>> I am unsure of the guarantees made by Ignite at this point in the event
>>> of failure. Specifically, hard/red-button failure of compute hardware
>>> resources and/or abrupt (but recoverable) detachment of storage resources.
>>>
>>> Scenario one: Primary node returns "OK", then immediately fails (before
>>> check point). When the primary node returns should I expect the replicated
>>> value to be in the primary, and to appear in all other nodes too.
>>>
>>> Scenario two: Primary node returns "OK", then a secondary node
>>> immediately fails (before achieving the write and so before any check
>>> point). When the secondary node returns should I expect the replicated
>>> value to be in the recovered secondary node?
>>>
>>> In relation to these scenarios, does setting the cache write
>>> synchronization mode improve the safety of the write as all nodes must
>>> acknowledge the write before it returns.
>>>
>>> If there is an improvement in write safety in this instance, does this
>>> imply the Fsync WalMode write pathway has opportunities for data loss in
>>> these failure situations?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>
>>>
>>>
>>> --
>>> <http://www.trimble.com/>
>>> Raymond Wilson
>>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>>> 11 Birmingham Drive | Christchurch, New Zealand
>>> raymond_wil...@trimble.com
>>>
>>>
>>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>>
>>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>
>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to