Nodes hanging when accessing queue

jpmoore40 Tue, 02 May 2017 07:18:11 -0700

Hi,

I've been having a couple of issues with a shared IgniteQueue which seem to
occur sporadically. I have a single publisher process which pushes data onto
a queue, and several worker processes that pull the data from the queue and
perform some processing. However these nodes are part of a larger qrid
containing other nodes that submit and process IgniteCallables. I partition
these using an environment variable clusterName=xxx so the IgniteCallables
are only computed on a particular cluster group. This seemed like the best
way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't
want to set up a different database for each discrete grid.


Several times I have found that the publishers and workers accessing the
IgniteQueue stop processing and there seem to be two separate problems
occurring.

The first was that I would get an exception such as the following when a
node was stopped:

java.lang.IllegalStateException: Cache has been stopped: datastructures_0
      at
org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus(GridCacheGateway.java:85)
      at
org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(GridCacheGateway.java:68)
      at
org.apache.ignite.internal.processors.datastructures.GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160)
      at
my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267)
      ...

This I think I solved (i.e. it hasn't happened since) by ensuring that the
CollectionConfiguration was initialised with backups, though if anyone can
confirm that would be helpful. 

However the second problem (which also causes the queue publisher and
workers to stop processing) is accompanied by repeated blocks of messages
such as the following:

2017-04-28 14:08:05,468 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
(JavaLogger.java:278) - Failed to process selector key (will close):
GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0,
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8,
sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode
[id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
/10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe,
isClient=true], connected=true, connectCnt=1, queueLimit=5120,
reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985,
rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0,
bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270,
lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]

2017-04-28 14:08:05,470 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
(JavaLogger.java:278) - Closing NIO session because of unhandled exception
[cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection was
forcibly closed by the remote host]

2017-04-28 14:08:14,279 WARN  [disco-event-worker-#20%null%] java.JavaLogger
(JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode
[id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
/10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:9ca40dbe,
isClient=true]
2017-04-28 14:08:14,287 INFO  [disco-event-worker-#20%null%] java.JavaLogger
(JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0,
CPUs=24, heap=3.5GB]
2017-04-28 14:08:14,295 INFO  [exchange-worker-#24%null%] java.JavaLogger
(JavaLogger.java:273) - Skipping rebalancing (nothing scheduled)
[top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED,
node=c91ce074-964e-4497-ac77-a3828b301ed3]

2017-04-28 14:08:35,853 WARN  [grid-timeout-worker-#7%null%] java.JavaLogger
(JavaLogger.java:278) - Found long running cache future
[startTime=14:06:52.182, curTime=14:08:35.828,
fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion
[topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl
[val=GridCacheQueueItemKey
[queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342,
queueName=cross-fx-curves, idx=519195], hasValBytes=true], readThrough=true,
forcePrimary=false, futId=c8ca4a5cb51-f5c26da7-4123-4ba7-aa40-857ccd042342,
trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571, taskName=null,
deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true,
needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode
[id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1,
10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500,
/127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500],
discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382,
loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]]

When this happens the only solutions seems to be to restart all nodes on the
grid. The key to this I think is the "long running cache future" as this is
accessing my queue (named cross-fx-curves), but I've no idea what this is
doing and why it should be stuck. These events do seem to conincide with
restarts of nodes on the grid, but I have been unable to reproduce the issue
- I have tried killing each of the nodes individually with no impact on the
rest of the grid.

Can anyone provide any feedback on what could be causing this and how best
to rectify?

Thanks 
Jon



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Nodes hanging when accessing queue

Reply via email to