Sorry for the long delay in responding to this issue. I will work on replicating this issue in a more controlled test environment and try to grab thread dumps from there.
In a previous post you mentioned that the blocking in this thread dump should only happen when a data node is affected which is usually a server node and you also said that near cache consistency is observed continuously. If we have near caching enabled, does that mean clients become data nodes? If that's the case, does that explain why we are seeing blocking when a client crashes or hangs? Assuming this is related to near caching, is there any configuration to adjust this behavior to give us availability over perfect consistency? Having a failure on one client ripple across the entire system and effectively take down all other clients of that cluster is a major problem. We obviously want to avoid problems like an OOM error or a big GC pause in the client application but if these things happen we need to be able to absorb these gracefully and limit the blast radius to just that client node.