Re: Nodes hanging when accessing queue

Andrey Mashenkov Thu, 04 May 2017 11:12:48 -0700

Hi,

1. Yes, adding backups for datastructures cache (via AtomicConfiguration)
shoud help.


2. Seems, one or more of your nodes register with ipv6 address. Try to use
JVM option for all nodes: either -Djava.net.preferIPv4Stack=true or
 -Djava.net.preferIPv6Stack=true.

On Tue, May 2, 2017 at 5:17 PM, jpmoore40 <jonathan.mo...@ca-cib.com> wrote:

> Hi,
>
> I've been having a couple of issues with a shared IgniteQueue which seem to
> occur sporadically. I have a single publisher process which pushes data
> onto
> a queue, and several worker processes that pull the data from the queue and
> perform some processing. However these nodes are part of a larger qrid
> containing other nodes that submit and process IgniteCallables. I partition
> these using an environment variable clusterName=xxx so the IgniteCallables
> are only computed on a particular cluster group. This seemed like the best
> way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't
> want to set up a different database for each discrete grid.
>
> Several times I have found that the publishers and workers accessing the
> IgniteQueue stop processing and there seem to be two separate problems
> occurring.
>
> The first was that I would get an exception such as the following when a
> node was stopped:
>
> java.lang.IllegalStateException: Cache has been stopped: datastructures_0
>       at
> org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus(
> GridCacheGateway.java:85)
>       at
> org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(
> GridCacheGateway.java:68)
>       at
> org.apache.ignite.internal.processors.datastructures.
> GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160)
>       at
> my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267)
>       ...
>
> This I think I solved (i.e. it hasn't happened since) by ensuring that the
> CollectionConfiguration was initialised with backups, though if anyone can
> confirm that would be helpful.
>
> However the second problem (which also causes the queue publisher and
> workers to stop processing) is accompanied by repeated blocks of messages
> such as the following:
>
> 2017-04-28 14:08:05,468 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
> (JavaLogger.java:278) - Failed to process selector key (will close):
> GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0,
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8,
> sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode
> [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
> 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
> /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
> lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:
> 9ca40dbe,
> isClient=true], connected=true, connectCnt=1, queueLimit=5120,
> reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985,
> rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0,
> bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270,
> lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false,
> filterChain=FilterChain[filters=[GridNioCodecFilter
> [parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true],
> GridConnectionBytesVerifyFilter], accepted=false]]
>
> 2017-04-28 14:08:05,470 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
> (JavaLogger.java:278) - Closing NIO session because of unhandled exception
> [cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection
> was
> forcibly closed by the remote host]
>
> 2017-04-28 14:08:14,279 WARN  [disco-event-worker-#20%null%]
> java.JavaLogger
> (JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode
> [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
> 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
> /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
> lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:
> 9ca40dbe,
> isClient=true]
> 2017-04-28 14:08:14,287 INFO  [disco-event-worker-#20%null%]
> java.JavaLogger
> (JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0,
> CPUs=24, heap=3.5GB]
> 2017-04-28 14:08:14,295 INFO  [exchange-worker-#24%null%] java.JavaLogger
> (JavaLogger.java:273) - Skipping rebalancing (nothing scheduled)
> [top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED,
> node=c91ce074-964e-4497-ac77-a3828b301ed3]
>
> 2017-04-28 14:08:35,853 WARN  [grid-timeout-worker-#7%null%]
> java.JavaLogger
> (JavaLogger.java:278) - Found long running cache future
> [startTime=14:06:52.182, curTime=14:08:35.828,
> fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion
> [topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl
> [val=GridCacheQueueItemKey
> [queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342,
> queueName=cross-fx-curves, idx=519195], hasValBytes=true],
> readThrough=true,
> forcePrimary=false, futId=c8ca4a5cb51-f5c26da7-
> 4123-4ba7-aa40-857ccd042342,
> trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571,
> taskName=null,
> deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true,
> needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode
> [id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1,
> 10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500,
> /127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500],
> discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382,
> loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]]
>
> When this happens the only solutions seems to be to restart all nodes on
> the
> grid. The key to this I think is the "long running cache future" as this is
> accessing my queue (named cross-fx-curves), but I've no idea what this is
> doing and why it should be stuck. These events do seem to conincide with
> restarts of nodes on the grid, but I have been unable to reproduce the
> issue
> - I have tried killing each of the nodes individually with no impact on the
> rest of the grid.
>
> Can anyone provide any feedback on what could be causing this and how best
> to rectify?
>
> Thanks
> Jon
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Best regards,
Andrey V. Mashenkov

Re: Nodes hanging when accessing queue

Reply via email to