Hi,

1. Yes, adding backups for datastructures cache (via AtomicConfiguration)
shoud help.

2. Seems, one or more of your nodes register with ipv6 address. Try to use
JVM option for all nodes: either -Djava.net.preferIPv4Stack=true or
 -Djava.net.preferIPv6Stack=true.

On Tue, May 2, 2017 at 5:17 PM, jpmoore40 <jonathan.mo...@ca-cib.com> wrote:

> Hi,
>
> I've been having a couple of issues with a shared IgniteQueue which seem to
> occur sporadically. I have a single publisher process which pushes data
> onto
> a queue, and several worker processes that pull the data from the queue and
> perform some processing. However these nodes are part of a larger qrid
> containing other nodes that submit and process IgniteCallables. I partition
> these using an environment variable clusterName=xxx so the IgniteCallables
> are only computed on a particular cluster group. This seemed like the best
> way of doing things as I am using the TcpDiscoveryJdbcIpFinder and didn't
> want to set up a different database for each discrete grid.
>
> Several times I have found that the publishers and workers accessing the
> IgniteQueue stop processing and there seem to be two separate problems
> occurring.
>
> The first was that I would get an exception such as the following when a
> node was stopped:
>
> java.lang.IllegalStateException: Cache has been stopped: datastructures_0
>       at
> org.apache.ignite.internal.processors.cache.GridCacheGateway.checkStatus(
> GridCacheGateway.java:85)
>       at
> org.apache.ignite.internal.processors.cache.GridCacheGateway.enter(
> GridCacheGateway.java:68)
>       at
> org.apache.ignite.internal.processors.datastructures.
> GridCacheQueueProxy.contains(GridCacheQueueProxy.java:160)
>       at
> my.com.CrossFxCurvePubService.addToQueue(CrossFxCurvePubService.java:267)
>       ...
>
> This I think I solved (i.e. it hasn't happened since) by ensuring that the
> CollectionConfiguration was initialised with backups, though if anyone can
> confirm that would be helpful.
>
> However the second problem (which also causes the queue publisher and
> workers to stop processing) is accompanied by repeated blocks of messages
> such as the following:
>
> 2017-04-28 14:08:05,468 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
> (JavaLogger.java:278) - Failed to process selector key (will close):
> GridSelectorNioSessionImpl [selectorIdx=2, queueSize=0,
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> recovery=GridNioRecoveryDescriptor [acked=9, resendCnt=0, rcvCnt=8,
> sentCnt=9, reserved=true, lastAck=8, nodeLeft=false, node=TcpDiscoveryNode
> [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
> 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
> /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
> lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:
> 9ca40dbe,
> isClient=true], connected=true, connectCnt=1, queueLimit=5120,
> reserveCnt=1], super=GridNioSessionImpl [locAddr=/10.127.246.164:60985,
> rmtAddr=/10.127.197.150:47100, createTime=1493384812272, closeTime=0,
> bytesSent=73469, bytesRcvd=1053, sndSchedTime=1493384869270,
> lastSndTime=1493384831058, lastRcvTime=1493384869270, readsPaused=false,
> filterChain=FilterChain[filters=[GridNioCodecFilter
> [parser=o.a.i.i.util.nio.GridDirectParser@1b4d47c, directMode=true],
> GridConnectionBytesVerifyFilter], accepted=false]]
>
> 2017-04-28 14:08:05,470 WARN  [grid-nio-worker-2-#11%null%] java.JavaLogger
> (JavaLogger.java:278) - Closing NIO session because of unhandled exception
> [cls=class o.a.i.i.util.nio.GridNioException, msg=An existing connection
> was
> forcibly closed by the remote host]
>
> 2017-04-28 14:08:14,279 WARN  [disco-event-worker-#20%null%]
> java.JavaLogger
> (JavaLogger.java:278) - Node FAILED: TcpDiscoveryNode
> [id=c91ce074-964e-4497-ac77-a3828b301ed3, addrs=[0:0:0:0:0:0:0:1,
> 10.127.197.150, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:0,
> /10.127.197.150:0, /127.0.0.1:0], discPort=0, order=161, intOrder=84,
> lastExchangeTime=1493384810687, loc=false, ver=1.8.0#20161205-sha1:
> 9ca40dbe,
> isClient=true]
> 2017-04-28 14:08:14,287 INFO  [disco-event-worker-#20%null%]
> java.JavaLogger
> (JavaLogger.java:273) - Topology snapshot [ver=162, servers=6, clients=0,
> CPUs=24, heap=3.5GB]
> 2017-04-28 14:08:14,295 INFO  [exchange-worker-#24%null%] java.JavaLogger
> (JavaLogger.java:273) - Skipping rebalancing (nothing scheduled)
> [top=AffinityTopologyVersion [topVer=162, minorTopVer=0], evt=NODE_FAILED,
> node=c91ce074-964e-4497-ac77-a3828b301ed3]
>
> 2017-04-28 14:08:35,853 WARN  [grid-timeout-worker-#7%null%]
> java.JavaLogger
> (JavaLogger.java:278) - Found long running cache future
> [startTime=14:06:52.182, curTime=14:08:35.828,
> fut=GridPartitionedSingleGetFuture [topVer=AffinityTopologyVersion
> [topVer=161, minorTopVer=0], key=UserKeyCacheObjectImpl
> [val=GridCacheQueueItemKey
> [queueId=9c0396aab51-f5c26da7-4123-4ba7-aa40-857ccd042342,
> queueName=cross-fx-curves, idx=519195], hasValBytes=true],
> readThrough=true,
> forcePrimary=false, futId=c8ca4a5cb51-f5c26da7-
> 4123-4ba7-aa40-857ccd042342,
> trackable=true, subjId=efe9e46d-6dbd-4ca1-b7fb-7ace46d37571,
> taskName=null,
> deserializeBinary=true, skipVals=false, expiryPlc=null, canRemap=true,
> needVer=false, keepCacheObjects=false, node=TcpDiscoveryNode
> [id=1bbf12c4-74b7-490d-b3fc-d8b3ef713ac0, addrs=[0:0:0:0:0:0:0:1,
> 10.127.246.74, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1:47500,
> /127.0.0.1:47500, LNAPLN343PRD.ldn.emea.cib/10.127.246.74:47500],
> discPort=47500, order=159, intOrder=83, lastExchangeTime=1493213394382,
> loc=false, ver=1.8.0#20161205-sha1:9ca40dbe, isClient=false]]]
>
> When this happens the only solutions seems to be to restart all nodes on
> the
> grid. The key to this I think is the "long running cache future" as this is
> accessing my queue (named cross-fx-curves), but I've no idea what this is
> doing and why it should be stuck. These events do seem to conincide with
> restarts of nodes on the grid, but I have been unable to reproduce the
> issue
> - I have tried killing each of the nodes individually with no impact on the
> rest of the grid.
>
> Can anyone provide any feedback on what could be causing this and how best
> to rectify?
>
> Thanks
> Jon
>
>
>
> --
> View this message in context: http://apache-ignite-users.
> 70518.x6.nabble.com/Nodes-hanging-when-accessing-queue-tp12343.html
> Sent from the Apache Ignite Users mailing list archive at Nabble.com.
>



-- 
Best regards,
Andrey V. Mashenkov

Reply via email to