I have 3 node cluster with 20+ client and it's running in spark
context.Initially it working fine but randomly get issue whenever new node
i.e. client try to connect with cluster.The cluster getting inoperative.I
have got following logs when its stuck.If I restart any Ignite server
explicitly then its release and work fine.I have use Ignite 2.4.0 version.
same issue produced in Ignite 2.5.0 version too.
client side Logs
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion
[topVer=44, minorTopVer=0], node=4d885cfd-45ed-43a2-8088-f35c9469797f].
Dumping pending objects that might be the cause:
GridDhtPartitionsExchangeFuture
[topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0],
evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=[0:0:0:0:0:0:0:1%lo,
10.13.10.179, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0,
/127.0.0.1:0, hdn6.mstorm.com/10.13.10.179:0], discPort=0, order=44,
intOrder=0, lastExchangeTime=1527651620413, loc=true,
ver=2.4.0#20180305-sha1:aa342270, isClient=true], done=false]
Failed to wait for partition map exchange [topVer=AffinityTopologyVersion
[topVer=44, minorTopVer=0], node=4d885cfd-45ed-43a2-8088-f35c9469797f].
Dumping pending objects that might be the cause:
GridDhtPartitionsExchangeFuture
[topVer=AffinityTopologyVersion [topVer=44, minorTopVer=0],
evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=[0:0:0:0:0:0:0:1%lo,
10.13.10.179, 127.0.0.1], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0,
/127.0.0.1:0, hdn6.mstorm.com/10.13.10.179:0], discPort=0, order=44,
intOrder=0, lastExchangeTime=1527651620413, loc=true,
ver=2.4.0#20180305-sha1:aa342270, isClient=true], done=false]
Failed to wait for initial partition map exchange. Possible reasons are:
^-- Transactions in deadlock. ^-- Long running transactions (ignore if this
is the case). ^-- Unreleased explicit locks.
Still waiting for initial partition map exchange
[fut=GridDhtPartitionsExchangeFuture [firstDiscoEvt=DiscoveryEvent
[evtNode=TcpDiscoveryNode [id=4d885cfd-45ed-43a2-8088-f35c9469797f, addrs=
Server Side Logs
Possible starvation in striped pool. Thread name: sys-stripe-0-#1 Queue:
[Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtTxPrepareResponse
[nearEvicted=null, futId=869dd4ca361-fe7e167d-4d80-4f57-b004-13359a9f2c11,
miniId=1, super=GridDistributedTxPrepareResponse [txState=null, part=-1,
err=null, super=GridDistributedBaseMessage [ver=GridCacheVersion
[topVer=139084030, order=1527604094903, nodeOrder=1], committedVers=null,
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=0]],
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false,
msg=GridDhtAtomicSingleUpdateRequest [key=KeyCacheObjectImpl [part=984,
val=null, hasValBytes=true], val=BinaryObjectImpl [arr= true, ctx=false,
start=0], prevVal=null, super=GridDhtAtomicAbstractUpdateRequest
[onRes=false, nearNodeId=null, nearFutId=0, flags=,
o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$DeferredUpdateTimeout@2735c674,
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false, msg=GridDhtTxPrepareRequest
[nearNodeId=628e3078-17fd-4e49-b9ae-ad94ad97a2f1,
futId=6576e4ca361-6e7cdac2-d5a3-4624-9ad3-b93f25546cc3, miniId=1,
topVer=AffinityTopologyVersion [topVer=20, minorTopVer=0],
invalidateNearEntries={}, nearWrites=null, owned=null,
nearXidVer=GridCacheVersion [topVer=139084030, order=1527604094933,
nodeOrder=2], subjId=628e3078-17fd-4e49-b9ae-ad94ad97a2f1, taskNameHash=0,
preloadKeys=null, super=GridDistributedTxPrepareRequest [threadId=86,
concurrency=OPTIMISTIC, isolation=READ_COMMITTED, writeVer=GridCacheVersion
[topVer=139084030, order=1527604094935, nodeOrder=2], timeout=0,
reads=null, writes=[IgniteTxEntry [key=BinaryObjectImpl [arr= true,
ctx=false, start=0], cacheId=-1755241537, txKey=null, val=[op=UPDATE,
val=BinaryObjectImpl [arr= true, ctx=false, start=0]], prevVal=[op=NOOP,
val=null], oldVal=[op=NOOP, val=null], entryProcessorsCol=null, ttl=-1,
conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null,
filters=null, filtersPassed=false, filtersSet=false, entry=null,
prepared=0, locked=false, nodeId=null, locMapped=false, expiryPlc=null,
transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null,
xidVer=null]], dhtVers=null, txSize=0, plc=2, txState=null,
flags=onePhase|last, super=GridDistributedBaseMessage [ver=GridCacheVersion
[topVer=139084030, order=1527604094933, nodeOrder=2], committedVers=null,
rolledbackVers=null, cnt=0, super=GridCacheIdMessage [cacheId=0]],
Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE, topicOrd=8,
ordered=false, timeout=0, skipOnTimeout=false,
msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=2,
arr=[65774,65775], Messag