Hi, I've been having a couple of issues with a shared IgniteQueue which seem to occur sporadically. I have a single publisher process which pushes data onto a queue, and several worker processes that pull the data from the queue and perform some processing. However these nodes are part of a larger qrid containing other nodes that submit and process IgniteCallables. I partition these using an environment variable clusterName=xxx so the IgniteCallables are only computed on a particular cluster group. This seemed like the best way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't want to set up a different database for each discrete grid.
Several times I have found that the publishers and workers accessing the IgniteQueue stop processing and there seem to be two separate problems occurring. The first was that I would get an exception such as the following when a node was stopped: java.lang.IllegalStateException: Cache has been stopped: datastructures_0 at org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus(GridCacheGateway.java:85) at org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(GridCacheGateway.java:68) at org.apache.ignite.internal.processors.datastructures.GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160) at my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267) ... This I think I solved (i.e. it hasn't happened since) by ensuring that the CollectionConfiguration was initialised with backups, though if anyone can confirm that would be helpful. However the second problem (which also causes the queue publisher and workers to stop processing) is accompanied by repeated blocks of messages such as the following: 2017-04-28 14:08:05,468 WARN [grid-nio-worker-2-#11%null%] java.JavaLogger (JavaLogger.java:278) - Failed to process selector key (will close): GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0, writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8, sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1, 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0, /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84, lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=true], connected=true, connectCnt=1, queueLimit=5120, reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985, rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0, bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270, lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true], GridConnectionBytesVerifyFilter], accepted=false]] 2017-04-28 14:08:05,470 WARN [grid-nio-worker-2-#11%null%] java.JavaLogger (JavaLogger.java:278) - Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection was forcibly closed by the remote host] 2017-04-28 14:08:14,279 WARN [disco-event-worker-#20%null%] java.JavaLogger (JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1, 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0, /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84, lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=true] 2017-04-28 14:08:14,287 INFO [disco-event-worker-#20%null%] java.JavaLogger (JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0, CPUs=24, heap=3.5GB] 2017-04-28 14:08:14,295 INFO [exchange-worker-#24%null%] java.JavaLogger (JavaLogger.java:273) - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED, node=c91ce074-964e-4497-ac77-a3828b301ed3] 2017-04-28 14:08:35,853 WARN [grid-timeout-worker-#7%null%] java.JavaLogger (JavaLogger.java:278) - Found long running cache future [startTime=14:06:52.182, curTime=14:08:35.828, fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion [topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl [val=GridCacheQueueItemKey [queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342, queueName=cross-fx-curves, idx=519195], hasValBytes=true], readThrough=true, forcePrimary=false, futId=c8ca4a5cb51-f5c26da7-4123-4ba7-aa40-857ccd042342, trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571, taskName=null, deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true, needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode [id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1, 10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500], discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]] When this happens the only solutions seems to be to restart all nodes on the grid. The key to this I think is the "long running cache future" as this is accessing my queue (named cross-fx-curves), but I've no idea what this is doing and why it should be stuck. These events do seem to conincide with restarts of nodes on the grid, but I have been unable to reproduce the issue - I have tried killing each of the nodes individually with no impact on the rest of the grid. Can anyone provide any feedback on what could be causing this and how best to rectify? Thanks Jon -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.