Re: Re: Nodes can not join the cluster after reboot

Evgenii Zhuravlev Thu, 18 Jan 2018 05:59:33 -0800

Aaron, could you share code
of com.tophold.trade.ignite.service.CommandRemoteService ?


Thanks,
Evgenii

2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e.zhuravlev...@gmail.com>:

> Hi Aaron,
>
> I think that the main problem is here:
>
> GridServiceProcessor - Error when executing service: null
>
> diagnostic - Pending transactions:
> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote 
> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
> [explicitVers=null, started=true, commitAllowed=0, 
> txState=IgniteTxRemoteSingleStateImpl
> [entry=IgniteTxEntry [key=KeyCacheObjectImpl [part=72,
> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
> hasValBytes=true], cacheId=-2100569601, txKey=IgniteTxKey
> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
> cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl
> [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed,
> topVer=15, cfg=LazyServiceConfiguration [srvcClsName=com.tophold.
> trade.ignite.service.CommandRemoteService, svcCls=, 
> nodeFilterCls=CommandServiceNodeFilter],
> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]],
> prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null],
> entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null,
> explicitVer=null, dhtVer=null, filters=[], filtersPassed=false,
> filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, 
> super=GridDistributedCacheEntry
> [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=72,
> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
> hasValBytes=true], val=CacheObjectImpl [val=GridServiceAssignments
> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.
> trade.ignite.service.CommandRemoteService, svcCls=, 
> nodeFilterCls=CommandServiceNodeFilter],
> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true],
> startVer=1516183996434, ver=GridCacheVersion [topVer=127663998,
> order=1516184119343, nodeOrder=10], hash=-1440463172, 
> extras=GridCacheMvccEntryExtras
> [mvcc=GridCacheMvcc [locs=null, rmts=[GridCacheMvccCandidate
> [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, ver=GridCacheVersion
> [topVer=127664000, order=1516193727420, nodeOrder=10], threadId=585, id=82,
> topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null,
> otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null,
> mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null,
> key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
> masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_
> implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null,
> nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null,
> locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0,
> partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter
> [xidVer=GridCacheVersion [topVer=127664000, order=1516193727420,
> nodeOrder=10], writeVer=GridCacheVersion [topVer=127664000,
> order=1516193727421, nodeOrder=10], implicit=false, loc=false,
> threadId=585, startTime=1516186483489, 
> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
> startVer=GridCacheVersion [topVer=127664000, order=1516193739547,
> nodeOrder=5], endVer=null, isolation=REPEATABLE_READ,
> concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=5,
> commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED,
> timedOut=false, topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
> duration=36138ms, onePhaseCommit=false]]]]
>
> You have the pending transaction in logs related to the service
> deployment. Most possible that your service threw NPE in init(or other)
> method and wasn't deployed. Could you check if it's possible that your
> service will throw NPE?
>
> Evgenii
>
>
> 2018-01-17 15:40 GMT+03:00 aa...@tophold.com <aa...@tophold.com>:
>
>> Hi Evgenii,
>>
>> What's more interesting If we reboot them in very shut time like one
>> hour,  from our monitor log we can find
>>
>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>
>> But if after several hours, problem below sure will happen if you try to
>> reboot any node from cluster.
>>
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>> *From:* aa...@tophold.com
>> *Date:* 2018-01-17 20:05
>> *To:* user <user@ignite.apache.org>
>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>> hi Evgenii,
>>
>> Thanks!  We collect some logs, one is the server which is reboot, another
>> two are two servers exist,  one client only nodes.  after reboot:
>>
>> 1. the reboot node never be totally brought up, waiting for ever.
>> 2. other server nodes after get notification the reboot node down, soon
>> hang up there also.
>> 3. the pure client node, only call a remote service on the reboot node,
>> also hang up there
>>
>> At around 2018-01-17 10:54  we reboot the node. From the log we can find:
>>
>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>> overyEvents - All server nodes for the following
>> caches have left the cluster: 'PortfolioCommandService_SVC_C
>> O_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'P
>> ortfolioGenericDomainEventEntry'
>>
>> Soon a ERROR log(Seem the only ERROR level log):
>>
>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>> ServiceProcessor - Error when executing service: null java.
>> lang.IllegalStateException: Getting affinity for topology ve
>> rsion earlier than affinity is calculated
>>
>> Then a lot WARN of
>>
>> "Failed to wait for partition release future........................."
>>
>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>  All node eventually output very similar.
>>
>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig]
>> diagnostic - Pending explicit locks:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending cache futures:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending atomic cache futures:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending data streamer futures:
>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig]
>> diagnostic - Pending transaction deadlock detection futures:
>>
>> Some of our environment:
>>
>> 1. we open the peer class loading flag, but in fact we use fat jar every
>> class is shared.
>> 2. some nodes deploy service, we use them as RPC way.
>> 3. most cache in fact is LOCAL, only when must we make them shared
>> 4. use JDBC to persist important caches
>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>
>> All others configuration is according to the stand.
>>
>> Thanks for your time!
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>>
>> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com>
>> *Date:* 2018-01-16 20:32
>> *To:* user <user@ignite.apache.org>
>> *Subject:* Re: Nodes can not join the cluster after reboot
>> Hi,
>>
>> Most possible that on the of the nodes you have hanged
>> transaction/future/lock or even a deadlock, that's why new nodes can't join
>> cluster - they can't perform exchange due to pending operation. Please
>> share full logs from all nodes with thread dumps, it will help to find a
>> root cause.
>>
>> Evgenii
>>
>> 2018-01-16 5:35 GMT+03:00 aa...@tophold.com <aa...@tophold.com>:
>>
>>> Hi All,
>>>
>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>> memory issue we schedule reboot those nodes at middle night.
>>>
>>> but in order to keep the service supplied, we reboot them one by one
>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>>> reboot nodes can never join to the cluster again.
>>>
>>> Eventually the entire cluster can not work any more forever waiting for
>>> joining to the topology; we need to kill all and reboot from started, this
>>> sound incredible.
>>>
>>> I not sure whether any more meet this issue before, or any mistake we
>>> may make, attached is the ignite log.
>>>
>>>
>>> Thanks for your time!
>>>
>>> Regards
>>> Aaron
>>> ------------------------------
>>> Aaron.Kuai
>>>
>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Reply via email to