Hi Aaron,

I think that the main problem is here:

GridServiceProcessor - Error when executing service: null

diagnostic - Pending transactions:
[WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
[ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
[explicitVers=null, started=true, commitAllowed=0,
txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
[key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
[name=CRS_com_tophold_trade_product_command], hasValBytes=true],
cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
val=CacheObjectImpl [val=GridServiceAssignments
[nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
svcCls=, nodeFilterCls=CommandServiceNodeFilter],
assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]],
prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
explicitVer=null, dhtVer=null, filters=[], filtersPassed=false,
filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72,
super=GridDistributedCacheEntry [super=GridCacheMapEntry
[key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
[name=CRS_com_tophold_trade_product_command], hasValBytes=true],
val=CacheObjectImpl [val=GridServiceAssignments
[nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
svcCls=, nodeFilterCls=CommandServiceNodeFilter],
assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true],
startVer=1516183996434, ver=GridCacheVersion [topVer=127663998,
order=1516184119343, nodeOrder=10], hash=-1440463172,
extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
minorTopVer=0], reentry=null,
otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null,
mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null,
key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
[name=CRS_com_tophold_trade_product_command], hasValBytes=true],
prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
[topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
loc=false, threadId=585, startTime=1516186483489,
nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
[topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
invalidParts=null, state=PREPARED, timedOut=false,
topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
duration=36138ms, onePhaseCommit=false]]]]

You have the pending transaction in logs related to the service deployment.
Most possible that your service threw NPE in init(or other) method and
wasn't deployed. Could you check if it's possible that your service will
throw NPE?


2018-01-17 15:40 GMT+03:00

> Hi Evgenii,
> What's more interesting If we reboot them in very shut time like one hour,
>  from our monitor log we can find
> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
> But if after several hours, problem below sure will happen if you try to
> reboot any node from cluster.
> Regards
> Aaron
Aaron.Kuai
*From:* aa...@tophold.com
*Date:* 2018-01-17 20:05
*To:* user <user@ignite.apache.org>
*Subject:* Re: Re: Nodes can not join the cluster after reboot
> hi Evgenii,
> Thanks!  We collect some logs, one is the server which is reboot, another
> two are two servers exist,  one client only nodes.  after reboot:
> 1. the reboot node never be totally brought up, waiting for ever.
> 2. other server nodes after get notification the reboot node down, soon
> hang up there also.
> 3. the pure client node, only call a remote service on the reboot node,
> also hang up there
> At around 2018-01-17 10:54  we reboot the node. From the log we can find:
> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig]
> ExchangeDiscoveryEvents - All server nodes for the
> following caches have left the cluster: 'PortfolioCommandService_SVC_
> CO_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', '
> PortfolioGenericDomainEventEntry'
> Soon a ERROR log(Seem the only ERROR level log):
> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig]
> GridServiceProcessor - Error when executing service: null java.lang.
> IllegalStateException: Getting affinity for topology
> version earlier than affinity is calculated
> Then a lot WARN of
> "Failed to wait for partition release future........................."
> Then this forever loop there, from the diagnose nothing seem suspicious,
>  All node eventually output very similar.
> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig]
>  diagnostic - Pending explicit locks:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending cache futures:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending atomic cache futures:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending data streamer futures:
> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>  diagnostic - Pending transaction deadlock detection futures:
> Some of our environment:
> 1. we open the peer class loading flag, but in fact we use fat jar every
> class is shared.
> 2. some nodes deploy service, we use them as RPC way.
> 3. most cache in fact is LOCAL, only when must we make them shared
> 4. use JDBC to persist important caches
> 5. TcpDiscoveryJdbcIpFinder as the finder
> All others configuration is according to the stand.
> Thanks for your time!
> Regards
> Aaron
Aaron.Kuai
*From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com>
*Date:* 2018-01-16 20:32
*To:* user <user@ignite.apache.org>
*Subject:* Re: Nodes can not join the cluster after reboot
> Hi,
> Most possible that on the of the nodes you have hanged
> transaction/future/lock or even a deadlock, that's why new nodes can't join
> cluster - they can't perform exchange due to pending operation. Please
> share full logs from all nodes with thread dumps, it will help to find a
> root cause.
> Evgenii
2018-01-16 5:35 GMT+03:00
>> Hi All,
>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>> memory issue we schedule reboot those nodes at middle night.
>> but in order to keep the service supplied, we reboot them one by one like
>> A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>> reboot nodes can never join to the cluster again.
>> Eventually the entire cluster can not work any more forever waiting for
>> joining to the topology; we need to kill all and reboot from started, this
>> sound incredible.
>> I not sure whether any more meet this issue before, or any mistake we may
>> make, attached is the ignite log.
>> Thanks for your time!
>> Regards
>> Aaron
Aaron.Kuai

