Using Ignite 2.10.0 We had a frustrating series of issues with Ignite the other day. We're using a 4-node cluster with 1 backup per table and cacheMode set to Partitioned, and write behind enabled. We have a client that inserts data into caches and another client that listens for new data in those caches. (Apologies I can't paste logs or configuration due to firm policy)
What happened: 1. We observed that our insertion client was not working after startup, it logged every 20 seconds that 'Still awaiting for initial partition map exchange.' This continued until we restarted the node it was trying to connect to, at which point the client connected to another node and the warning stopped. Possible Bug #1 - why didn't it automatically try a different node, or if it would have that same issue connecting to any node, why couldn't the cluster print an error and function anyhow? 2. After rebooting bad node #1, the insertion client still didn't work, it then started printing totally different warnings about 'First 10 long running cache futures [total=1]', whatever that means, and then printed the ID of a node. We killed that referenced node, and then everything started working. Again, why didn't the client switch to a good node automatically(or is there a way to configure such failover capability that I don't know about)? 3. In terms of root cause, it seems bad node #1 had a 'blocked system-critical thread' which according to the stack trace was blocked at CheckpointReadWriteLock.java line 69. Is there a way to automatically recover from this or handle this more gracefully? If not I will probably disable WAL (which I understand will disable checkpointing). Possible Bug #2 - why couldn't it recover from this lock if restarting fixed it? Regards, and thanks in advance, for any advice!