We've run into a problem recently where it appears our cache is deadlocking during loading. What I mean by "loading" is that we start up a new cluster in AWS, unconnected to any existing cluster, and then shove a bunch of data into it from Kafka. During this process it's not taking any significant traffic - just healthchecks, ingesting data, and me clicking around in it.
We've had several deployments in a row fail, apparently due to deadlocking in the loading process. We're typically seeing a number of threads blocked with stacktraces like this: "data-streamer-stripe-3-#20" id=124 state=WAITING at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177) at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140) at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.invoke(GridDhtAtomicCache.java:786) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.invoke(IgniteCacheProxyImpl.java:1359) at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.invoke(IgniteCacheProxyImpl.java:1405) at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.invoke(GatewayProtectedCacheProxy.java:1362) at com.mycompany.myapp.myPackage.dao.ignite.cache.streamer.VersionCheckingStreamReceiver.receive(VersionCheckingStreamReceiver.java:33) at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:397) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.processRequest(DataStreamProcessor.java:302) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.access$000(DataStreamProcessor.java:59) at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor$1.onMessage(DataStreamProcessor.java:89) at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1555) at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1183) at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:126) at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1090) at org.apache.ignite.internal.util.StripedExecutor$Stripe.run(StripedExecutor.java:505) at java.lang.Thread.run(Thread.java:748) The machines seem to go into a moderate-CPU loop (~70% usage). My best guess is all that is going to threads like this: "exchange-worker-#62" id=177 state=RUNNABLE at org.apache.ignite.internal.util.tostring.SBLimitedLength.toString(SBLimitedLength.java:283) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toStringImpl(GridToStringBuilder.java:1012) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toString(GridToStringBuilder.java:826) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toString(GridToStringBuilder.java:783) at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicAbstractUpdateFuture.toString(GridDhtAtomicAbstractUpdateFuture.java:588) at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicSingleUpdateFuture.toString(GridDhtAtomicSingleUpdateFuture.java:134) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:131) at java.util.AbstractCollection.toString(AbstractCollection.java:462) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:131) at org.apache.ignite.internal.processors.cache.CacheObjectsReleaseFuture.toString(CacheObjectsReleaseFuture.java:58) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:131) at java.util.AbstractCollection.toString(AbstractCollection.java:462) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:131) at org.apache.ignite.internal.processors.cache.CacheObjectsReleaseFuture.toString(CacheObjectsReleaseFuture.java:58) at java.lang.String.valueOf(String.java:2994) at org.apache.ignite.internal.util.GridStringBuilder.a(GridStringBuilder.java:101) at org.apache.ignite.internal.util.tostring.SBLimitedLength.a(SBLimitedLength.java:88) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toString(GridToStringBuilder.java:939) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toStringImpl(GridToStringBuilder.java:1005) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toString(GridToStringBuilder.java:685) at org.apache.ignite.internal.util.tostring.GridToStringBuilder.toString(GridToStringBuilder.java:621) at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.toString(GridDhtPartitionsExchangeFuture.java:3555) at java.lang.String.valueOf(String.java:2994) at java.lang.StringBuilder.append(StringBuilder.java:131) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager.dumpDebugInfo(GridCachePartitionExchangeManager.java:1569) at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2359) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110) at java.lang.Thread.run(Thread.java:748) I've seen elsewhere that putAll()/getAll() could cause deadlocks, but we're not using those. I don't believe slow network is the problem. What else can I look at or try to resolve this? Are we just throwing data into the caches too fast? Could a weird pattern in the data (eg, large entities) cause this? I've attached a full thread dump in case that helps. Thanks in advance, BKR IgniteStackTrace_redacted.txt <http://apache-ignite-users.70518.x6.nabble.com/file/t1824/IgniteStackTrace_redacted.txt> -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/