Hi, 1. Yes, adding backups for datastructures cache (via AtomicConfiguration) shoud help.
2. Seems, one or more of your nodes register with ipv6 address. Try to use JVM option for all nodes: either -Djava.net.preferIPv4Stack=true or -Djava.net.preferIPv6Stack=true. On Tue, May 2, 2017 at 5:17 PM, jpmoore40 <jonathan.mo...@ca-cib.com> wrote: > Hi, > > I've been having a couple of issues with a shared IgniteQueue which seem to > occur sporadically. I have a single publisher process which pushes data > onto > a queue, and several worker processes that pull the data from the queue and > perform some processing. However these nodes are part of a larger qrid > containing other nodes that submit and process IgniteCallables. I partition > these using an environment variable clusterName=xxx so the IgniteCallables > are only computed on a particular cluster group. This seemed like the best > way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't > want to set up a different database for each discrete grid. > > Several times I have found that the publishers and workers accessing the > IgniteQueue stop processing and there seem to be two separate problems > occurring. > > The first was that I would get an exception such as the following when a > node was stopped: > > java.lang.IllegalStateException: Cache has been stopped: datastructures_0 > at > org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus( > GridCacheGateway.java:85) > at > org.apache.ignite.internal.processors.cache.GridCacheGateway.enter( > GridCacheGateway.java:68) > at > org.apache.ignite.internal.processors.datastructures. > GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160) > at > my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267) > ... > > This I think I solved (i.e. it hasn't happened since) by ensuring that the > CollectionConfiguration was initialised with backups, though if anyone can > confirm that would be helpful. > > However the second problem (which also causes the queue publisher and > workers to stop processing) is accompanied by repeated blocks of messages > such as the following: > > 2017-04-28 14:08:05,468 WARN [grid-nio-worker-2-#11%null%] java.JavaLogger > (JavaLogger.java:278) - Failed to process selector key (will close): > GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0, > writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], > readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], > recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8, > sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode > [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1, > 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0, > /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84, > lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1: > 9ca40dbe, > isClient=true], connected=true, connectCnt=1, queueLimit=5120, > reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985, > rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0, > bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270, > lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false, > filterChain=FilterChain[filters=[GridNioCodecFilter > [parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true], > GridConnectionBytesVerifyFilter], accepted=false]] > > 2017-04-28 14:08:05,470 WARN [grid-nio-worker-2-#11%null%] java.JavaLogger > (JavaLogger.java:278) - Closing NIO session because of unhandled exception > [cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection > was > forcibly closed by the remote host] > > 2017-04-28 14:08:14,279 WARN [disco-event-worker-#20%null%] > java.JavaLogger > (JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode > [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1, > 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0, > /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84, > lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1: > 9ca40dbe, > isClient=true] > 2017-04-28 14:08:14,287 INFO [disco-event-worker-#20%null%] > java.JavaLogger > (JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0, > CPUs=24, heap=3.5GB] > 2017-04-28 14:08:14,295 INFO [exchange-worker-#24%null%] java.JavaLogger > (JavaLogger.java:273) - Skipping rebalancing (nothing scheduled) > [top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED, > node=c91ce074-964e-4497-ac77-a3828b301ed3] > > 2017-04-28 14:08:35,853 WARN [grid-timeout-worker-#7%null%] > java.JavaLogger > (JavaLogger.java:278) - Found long running cache future > [startTime=14:06:52.182, curTime=14:08:35.828, > fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion > [topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl > [val=GridCacheQueueItemKey > [queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342, > queueName=cross-fx-curves, idx=519195], hasValBytes=true], > readThrough=true, > forcePrimary=false, futId=c8ca4a5cb51-f5c26da7- > 4123-4ba7-aa40-857ccd042342, > trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571, > taskName=null, > deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true, > needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode > [id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1, > 10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500, > /127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500], > discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382, > loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]] > > When this happens the only solutions seems to be to restart all nodes on > the > grid. The key to this I think is the "long running cache future" as this is > accessing my queue (named cross-fx-curves), but I've no idea what this is > doing and why it should be stuck. These events do seem to conincide with > restarts of nodes on the grid, but I have been unable to reproduce the > issue > - I have tried killing each of the nodes individually with no impact on the > rest of the grid. > > Can anyone provide any feedback on what could be causing this and how best > to rectify? > > Thanks > Jon > > > > -- > View this message in context: http://apache-ignite-users. > 70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html > Sent from the Apache Ignite Users mailing list archive at Nabble.com. > -- Best regards, Andrey V. Mashenkov