Re: Another replicated cache oddity

2023-11-22 Thread Raymond Wilson
Hi Jeremy,

Thanks for the insightful reply.

To cover your points (and fill in some info I should have provided):

- We use Ignite persistence
- We use the FSYNC Wal mode
- We do not use the FULL_SYNC cache write synchronization mode for our
REPLICATED caches (we use PRIMARY_SYNC). ... and because we use persistence
we have to rebuild the entire grid to be able to change it, which is a pain
when you have terabytes of data. I discovered this when investigating the
other REPLICATED cache oddities we have seen (which was an annoying
oversight). If we do ever rebuild I will definitely be using the FULL_SYNC
mode (BTW, the Ignite documentation is fairly unequivocal about this 'not
mattering' in terms of updating all the nodes, as long as you are not after
dependable low-latency read-after-write consistency for those caches (which
we don't right now)).

We do have good exception tracking around Put etc operations (but the 'Put'
event occurred so long ago any logging related to it would have been
retired, and the operation would have failed at a user level too in this
instance).

ConfigureAwait() can be a vexed topic. We use .Net 7 which has a different
async context model to .Net Framework, and we're not running async
operations in UX contexts (this is all web services back end
implementation), and we don't care about which .Net thread pool thread
execution resumes on. The async calls are awaited so we get exposure to
exceptions and have strong reporting and handling of those.

This particular issue has been observed only once. In terms of workload, it
is very bursty with relatively low (to no) baseline workload on the read
side, but fairly dependable baseload on the ingest side.

I am familiar with the WAL model, and we have chosen the FSYNC mode to
favour consistency/durability of our data. I hope the "the door is left
open in many places to allow things to slip between the cracks" caveat does
not apply in that scenario :)

Raymond.


On Thu, Nov 23, 2023 at 11:45 AM Jeremy McMillan <
jeremy.mcmil...@gridgain.com> wrote:

> Do you do ConfigureAwait(False) in the code that does *ICache V>.PutAsync(), *or do you have some kind of handler to track whether
> there were problems with any particular put operation?
> https://ignite.apache.org/docs/latest/net-specific/net-async
> https://devblogs.microsoft.com/dotnet/configureawait-faq/
>
> The fastest way to do this basically says "don't know; don't care" whether
> any particular operation failed for any reason.
>
> There is also cache configuration which governs how the server
> acknowledges put operations.
>
> https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/CacheWriteSynchronizationMode.html
>
> The fastest cache sync mode is FULL_ASYNC, which, like .Net
> ConfigureAwait(False), says the cache can acknowledge the put as done, but
> we don't know and don't care if any or all of the replicas have been
> completely written. The mode PRIMARY_SYNC only guarantees the primary
> partition has been updated. FULL_SYNC waits until all replicas have been
> successfully updated. Even if you have callbacks in .Net to track whether
> anything got missed in the PutAsync(), it only knows what the cache's sync
> mode can assert.
>
> If you are using persistence, Ignite writes changes to a partition first
> into a Write Ahead Log (WAL). Data in the WAL is not yet in the cache, but
> the cluster has it, and can finish an interrupted cache mutation if the WAL
> record for that operation is complete. If you write to the cache with an
> async call, and then you attempt to read the write from the cache
> immediately, you may be racing against the WAL sync. The WAL itself has
> multiple consistency modes. The default mode is LOG_ONLY which allows the
> OS to acknowledge the WAL records with your updates have been buffered,
> before they have been written to disk. It is possible for Ignite to think
> something has been written, but is immediately killed after that, and the
> MMAP buffer is not flushed to disk by the OS.
>
> https://ignite.apache.org/docs/2.11.1/persistence/native-persistence#wal-modes
>
> https://superuser.com/questions/1288890/will-buffer-be-automatically-flushed-to-disk-when-a-process-exits
>
> There are lots of tunables which allow any implementation to shift the
> priority from consistency to performance. If consistency is sacrificed for
> performance, especially under heavy load, some data loss can be possible
> because the door is left open in many places to allow things to slip
> between the cracks. I invite you experiment with FULL_SYNC cache mode and
> FSYNC WAL mode, and use synchronous puts in special threads when writing
> critical data to the cache, or at least use callbacks to detect whether
> something got interrupted, which also provides an opportunity to recover
> in case this is a corner case that doesn't justify a consistency-maximized
> cache setup.
>
> It's entirely possible your cache system is only set up to provide "best

Re: Another replicated cache oddity

2023-11-22 Thread Jeremy McMillan
Do you do ConfigureAwait(False) in the code that does *ICache.PutAsync(), *or do you have some kind of handler to track whether
there were problems with any particular put operation?
https://ignite.apache.org/docs/latest/net-specific/net-async
https://devblogs.microsoft.com/dotnet/configureawait-faq/

The fastest way to do this basically says "don't know; don't care" whether
any particular operation failed for any reason.

There is also cache configuration which governs how the server acknowledges
put operations.
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/cache/CacheWriteSynchronizationMode.html

The fastest cache sync mode is FULL_ASYNC, which, like .Net
ConfigureAwait(False), says the cache can acknowledge the put as done, but
we don't know and don't care if any or all of the replicas have been
completely written. The mode PRIMARY_SYNC only guarantees the primary
partition has been updated. FULL_SYNC waits until all replicas have been
successfully updated. Even if you have callbacks in .Net to track whether
anything got missed in the PutAsync(), it only knows what the cache's sync
mode can assert.

If you are using persistence, Ignite writes changes to a partition first
into a Write Ahead Log (WAL). Data in the WAL is not yet in the cache, but
the cluster has it, and can finish an interrupted cache mutation if the WAL
record for that operation is complete. If you write to the cache with an
async call, and then you attempt to read the write from the cache
immediately, you may be racing against the WAL sync. The WAL itself has
multiple consistency modes. The default mode is LOG_ONLY which allows the
OS to acknowledge the WAL records with your updates have been buffered,
before they have been written to disk. It is possible for Ignite to think
something has been written, but is immediately killed after that, and the
MMAP buffer is not flushed to disk by the OS.
https://ignite.apache.org/docs/2.11.1/persistence/native-persistence#wal-modes
https://superuser.com/questions/1288890/will-buffer-be-automatically-flushed-to-disk-when-a-process-exits

There are lots of tunables which allow any implementation to shift the
priority from consistency to performance. If consistency is sacrificed for
performance, especially under heavy load, some data loss can be possible
because the door is left open in many places to allow things to slip
between the cracks. I invite you experiment with FULL_SYNC cache mode and
FSYNC WAL mode, and use synchronous puts in special threads when writing
critical data to the cache, or at least use callbacks to detect whether
something got interrupted, which also provides an opportunity to recover
in case this is a corner case that doesn't justify a consistency-maximized
cache setup.

It's entirely possible your cache system is only set up to provide "best
effort" consistency, while your compute tasks need more. At GridGain, this
is often felt by our commercial clients when their clusters accrete new
workloads that were not a part of the original design. I've only been with
GridGain a short while, but I've heard this question raised more than once,
and the tradeoffs and consistency design is well documented, so my guess is
that's popular subject matter for similar reasons to your experience.

Are there any signs of strain in any of the Ignite servers' monitoring
metrics when these events happen? What kind of accumulated workload is on
the cluster at those times, and how often does this happen?


On Wed, Nov 22, 2023 at 1:50 PM Raymond Wilson 
wrote:

> Hi Jeremy,
>
> My initial query was to see if this had been observed by others.
>
> To answer some of your questions:
>
> Do we specifically check that an element is added to all nodes
> participating in a replicated cache: No, we do not (we take it on trust
> Ignite sorts that out ;) )
>
> Do we think it is a race condition? No, for three reasons: (1) The grid
> was restarted in the interval between initial addition of the element and
> the time the three nodes were failing to perform the Get(), (2) This
> particular element failed on the same three nodes over many invocations of
> the request over a substantial time period and (3) a subsequent grid
> restart fixed the problem.
>
> From our logs we don't see delays, timeouts or Ignite logged errors
> relating to the Get().
>
> In terms of troubleshooting this has been a bit tricky. In this instance
> only this one element (of many thousands of similar elements with similar
> cluster compute requests being made across them) failed. And only within
> the duration between a pair of grid restarts.
>
> The replicated cache update is just a simple ICache.PutAsync() with
> a key struct and a byte[] array as payload. In terms of the distributed
> compute code it is just performing a simple ICache.GetAsync() with
> the key struct.
>
> So far it seems like the three failing nodes just temporarily 'forgot'
> they had this element, and remembered it again after the restart.
>
> 

Re: Another replicated cache oddity

2023-11-22 Thread Raymond Wilson
Hi Jeremy,

My initial query was to see if this had been observed by others.

To answer some of your questions:

Do we specifically check that an element is added to all nodes
participating in a replicated cache: No, we do not (we take it on trust
Ignite sorts that out ;) )

Do we think it is a race condition? No, for three reasons: (1) The grid was
restarted in the interval between initial addition of the element and the
time the three nodes were failing to perform the Get(), (2) This particular
element failed on the same three nodes over many invocations of the request
over a substantial time period and (3) a subsequent grid restart fixed
the problem.

>From our logs we don't see delays, timeouts or Ignite logged errors
relating to the Get().

In terms of troubleshooting this has been a bit tricky. In this instance
only this one element (of many thousands of similar elements with similar
cluster compute requests being made across them) failed. And only within
the duration between a pair of grid restarts.

The replicated cache update is just a simple ICache.PutAsync() with a
key struct and a byte[] array as payload. In terms of the distributed
compute code it is just performing a simple ICache.GetAsync() with
the key struct.

So far it seems like the three failing nodes just temporarily 'forgot' they
had this element, and remembered it again after the restart.

For context, this is the first time we have seen this specific issue on a
system that has been running in production for 2+ years now. We have seen
numerous instances with replicated caches where Ignite has (permanently)
failed to write at least one, but not all, copies of the element where grid
restarts do not correct the issue. This does not feel the same though.

Raymond.





On Thu, Nov 23, 2023 at 6:50 AM Jeremy McMillan <
jeremy.mcmil...@gridgain.com> wrote:

> I suspect a race condition with async mode caches. This is a naive guess
> though, as we don't have enough details. I'll assume this is a plea for
> help in troubleshooting methodology and the question is really "what should
> we look at next?"
>
> The real answer comes from tracing the insert of element E and subsequent
> cache get() failures. Do we know if E was completely inserted into each
> replicated cache backup partition prior to the get()? Do we know if the
> reported cache get() failure was actually a fully functioning cache lookup
> and retrieval that failed during lookup, or were there timeouts or
> exceptions indicating something abnormal was happening?
>
> What steps did you take to troubleshoot this issue, and what is the
> cluster and cache configuration in play? What does the code look like for
> the updates to the replicated cache, and what does the code look like for
> the distributed compute operation?
>
> On Tue, Nov 21, 2023 at 5:21 PM Raymond Wilson 
> wrote:
>
>> Hi,
>>
>> We have been triaging an odd issue we encountered in a system using
>> Ignite v2.15 and the C# client.
>>
>> We have a replicated cache across four nodes, lets call them P0, P1, P2 &
>> P3. Because the cache is replicated every item added to the cache is
>> present in each of P0, P1, P2 and P3.
>>
>> Some time ago an element (E) was added to this cache (among many others).
>> A number of system restarts have occurred since that time.
>>
>> We started observing an issue where a query running across P0/P1/P2/P3 as
>> a cluster compute operation needed to load element E on each of the nodes
>> to perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all
>> reported that element E did not exist.
>>
>> This situation persisted until the cluster was restarted, after which the
>> same query that had been failing now succeeded as all four 'P' nodes were
>> able to read element E.
>>
>> There were no Ignite errors reported in the context of these
>> failing queries to indicate unhappiness in the Ignite nodes.
>>
>> This seems like very strange behaviour. Are there any suggestions as to
>> what could be causing this failure to read the replicated value on the
>> three failing nodes, especially as the element 'came back' after a cluster
>> restart?
>>
>> Thanks,
>> Raymond.
>>
>>
>>
>>
>> --
>> 
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> 
>>
>

-- 

Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com




Re: Another replicated cache oddity

2023-11-22 Thread Jeremy McMillan
I suspect a race condition with async mode caches. This is a naive guess
though, as we don't have enough details. I'll assume this is a plea for
help in troubleshooting methodology and the question is really "what should
we look at next?"

The real answer comes from tracing the insert of element E and subsequent
cache get() failures. Do we know if E was completely inserted into each
replicated cache backup partition prior to the get()? Do we know if the
reported cache get() failure was actually a fully functioning cache lookup
and retrieval that failed during lookup, or were there timeouts or
exceptions indicating something abnormal was happening?

What steps did you take to troubleshoot this issue, and what is the cluster
and cache configuration in play? What does the code look like for the
updates to the replicated cache, and what does the code look like for the
distributed compute operation?

On Tue, Nov 21, 2023 at 5:21 PM Raymond Wilson 
wrote:

> Hi,
>
> We have been triaging an odd issue we encountered in a system using Ignite
> v2.15 and the C# client.
>
> We have a replicated cache across four nodes, lets call them P0, P1, P2 &
> P3. Because the cache is replicated every item added to the cache is
> present in each of P0, P1, P2 and P3.
>
> Some time ago an element (E) was added to this cache (among many others).
> A number of system restarts have occurred since that time.
>
> We started observing an issue where a query running across P0/P1/P2/P3 as
> a cluster compute operation needed to load element E on each of the nodes
> to perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all
> reported that element E did not exist.
>
> This situation persisted until the cluster was restarted, after which the
> same query that had been failing now succeeded as all four 'P' nodes were
> able to read element E.
>
> There were no Ignite errors reported in the context of these
> failing queries to indicate unhappiness in the Ignite nodes.
>
> This seems like very strange behaviour. Are there any suggestions as to
> what could be causing this failure to read the replicated value on the
> three failing nodes, especially as the element 'came back' after a cluster
> restart?
>
> Thanks,
> Raymond.
>
>
>
>
> --
> 
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> 
>


Re: Another replicated cache oddity

2023-11-21 Thread Zhenya Stanilovsky via user

 
>Hi,
 
Hi, this really looks like very strange 
First of all you need to check consistency of your data : [1]
 
> Some time ago an element (E) was added to this cache (among many others)
And some time it will be all ok there ? Are you sure that this element was 
properly touched ? 
What king of cache you are talking about ? How data was populated there ? What 
API is used to « load element E» on each node ?
If you are talking about restart — i assume that you take a deal with 
persistent store, isn`t it ? is it native ignite persistence or some 3-rd party 
DB?
Thanks !
 
[1]  
https://ignite.apache.org/docs/latest/tools/control-script#verifying-partition-checksums
 
> 
>We have been triaging an odd issue we encountered in a system using Ignite 
>v2.15 and the C# client.
> 
>We have a replicated cache across four nodes, lets call them P0, P1, P2 & P3. 
>Because the cache is replicated every item added to the cache is present in 
>each of P0, P1, P2 and P3.
> 
>Some time ago an element (E) was added to this cache (among many others). A 
>number of system restarts have occurred since that time.
> 
>We started observing an issue where a query running across P0/P1/P2/P3 as a 
>cluster compute operation needed to load element E on each of the nodes to 
>perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all reported 
>that element E did not exist. 
> 
>This situation persisted until the cluster was restarted, after which the same 
>query that had been failing now succeeded as all four 'P' nodes were able to 
>read element E.
> 
>There were no Ignite errors reported in the context of these failing queries 
>to indicate unhappiness in the Ignite nodes.
> 
>This seems like very strange behaviour. Are there any suggestions as to what 
>could be causing this failure to read the replicated value on the three 
>failing nodes, especially as the element 'came back' after a cluster restart?
> 
>Thanks,
>Raymond.
> 
> 
> 
>  -- 
>
>Raymond Wilson
>Trimble Distinguished Engineer, Civil Construction Software (CCS)
>11 Birmingham Drive |  Christchurch, New Zealand
>raymond_wil...@trimble.com
>         
> 
 
 
 
 

Another replicated cache oddity

2023-11-21 Thread Raymond Wilson
Hi,

We have been triaging an odd issue we encountered in a system using Ignite
v2.15 and the C# client.

We have a replicated cache across four nodes, lets call them P0, P1, P2 &
P3. Because the cache is replicated every item added to the cache is
present in each of P0, P1, P2 and P3.

Some time ago an element (E) was added to this cache (among many others). A
number of system restarts have occurred since that time.

We started observing an issue where a query running across P0/P1/P2/P3 as a
cluster compute operation needed to load element E on each of the nodes to
perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all reported
that element E did not exist.

This situation persisted until the cluster was restarted, after which the
same query that had been failing now succeeded as all four 'P' nodes were
able to read element E.

There were no Ignite errors reported in the context of these
failing queries to indicate unhappiness in the Ignite nodes.

This seems like very strange behaviour. Are there any suggestions as to
what could be causing this failure to read the replicated value on the
three failing nodes, especially as the element 'came back' after a cluster
restart?

Thanks,
Raymond.




-- 

Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com