Ignite CheckpointReadLock /Long running cache futures

Mike Wiesenberg Mon, 06 Sep 2021 07:21:51 -0700

 Using Ignite 2.10.0

We had a frustrating series of issues with Ignite the other day. We're
using a 4-node cluster with 1 backup per table and cacheMode set to
Partitioned, and write behind enabled. We have a client that inserts data
into caches and another client that listens for new data in those caches.
(Apologies I can't paste logs or configuration due to firm policy)


What happened:

1. We observed that our insertion client was not working after startup, it
logged every 20 seconds that 'Still awaiting for initial partition map
exchange.' This continued until we restarted the node it was trying to
connect to, at which point the client connected to another node and the
warning stopped.

 Possible Bug #1 - why didn't it automatically try a different node, or if
it would have that same issue connecting to any node, why couldn't the
cluster print an error and function anyhow?

2. After rebooting bad node #1, the insertion client still didn't work, it
then started printing totally different warnings about 'First 10 long
running cache futures [total=1]', whatever that means, and then printed the
ID of a node. We killed that referenced node, and then everything started
working.

 Again, why didn't the client switch to a good node automatically(or is
there a way to configure such failover capability that I don't know about)?

3. In terms of root cause, it seems bad node #1 had a 'blocked
system-critical thread' which according to the stack trace was blocked at
CheckpointReadWriteLock.java line 69. Is there a way to automatically
recover from this or handle this more gracefully? If not I will probably
disable WAL (which I understand will disable checkpointing).

 Possible Bug #2 - why couldn't it recover from this lock if restarting
fixed it?

Regards, and thanks in advance, for any advice!

Ignite CheckpointReadLock /Long running cache futures

Reply via email to