I suspect a race condition with async mode caches. This is a naive guess
though, as we don't have enough details. I'll assume this is a plea for
help in troubleshooting methodology and the question is really "what should
we look at next?"

The real answer comes from tracing the insert of element E and subsequent
cache get() failures. Do we know if E was completely inserted into each
replicated cache backup partition prior to the get()? Do we know if the
reported cache get() failure was actually a fully functioning cache lookup
and retrieval that failed during lookup, or were there timeouts or
exceptions indicating something abnormal was happening?

What steps did you take to troubleshoot this issue, and what is the cluster
and cache configuration in play? What does the code look like for the
updates to the replicated cache, and what does the code look like for the
distributed compute operation?

On Tue, Nov 21, 2023 at 5:21 PM Raymond Wilson <raymond_wil...@trimble.com>
wrote:

> Hi,
>
> We have been triaging an odd issue we encountered in a system using Ignite
> v2.15 and the C# client.
>
> We have a replicated cache across four nodes, lets call them P0, P1, P2 &
> P3. Because the cache is replicated every item added to the cache is
> present in each of P0, P1, P2 and P3.
>
> Some time ago an element (E) was added to this cache (among many others).
> A number of system restarts have occurred since that time.
>
> We started observing an issue where a query running across P0/P1/P2/P3 as
> a cluster compute operation needed to load element E on each of the nodes
> to perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all
> reported that element E did not exist.
>
> This situation persisted until the cluster was restarted, after which the
> same query that had been failing now succeeded as all four 'P' nodes were
> able to read element E.
>
> There were no Ignite errors reported in the context of these
> failing queries to indicate unhappiness in the Ignite nodes.
>
> This seems like very strange behaviour. Are there any suggestions as to
> what could be causing this failure to read the replicated value on the
> three failing nodes, especially as the element 'came back' after a cluster
> restart?
>
> Thanks,
> Raymond.
>
>
>
>
> --
> <http://www.trimble.com/>
> Raymond Wilson
> Trimble Distinguished Engineer, Civil Construction Software (CCS)
> 11 Birmingham Drive | Christchurch, New Zealand
> raymond_wil...@trimble.com
>
>
> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>

Reply via email to