Aaron, could you share code of com.tophold.trade.ignite.service.CommandRemoteService ?
Thanks, Evgenii 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e.zhuravlev...@gmail.com>: > Hi Aaron, > > I think that the main problem is here: > > GridServiceProcessor - Error when executing service: null > > diagnostic - Pending transactions: > [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] > [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, > minorTopVer=0], exchWait=true, tx=GridDhtTxRemote > [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, > rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, > nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, > nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter > [explicitVers=null, started=true, commitAllowed=0, > txState=IgniteTxRemoteSingleStateImpl > [entry=IgniteTxEntry [key=KeyCacheObjectImpl [part=72, > val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], > hasValBytes=true], cacheId=-2100569601, txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey > [name=CRS_com_tophold_trade_product_command], hasValBytes=true], > cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl > [val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, > topVer=15, cfg=LazyServiceConfiguration [srvcClsName=com.tophold. > trade.ignite.service.CommandRemoteService, svcCls=, > nodeFilterCls=CommandServiceNodeFilter], > assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]], > prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], > entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, > explicitVer=null, dhtVer=null, filters=[], filtersPassed=false, > filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, > super=GridDistributedCacheEntry > [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=72, > val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], > hasValBytes=true], val=CacheObjectImpl [val=GridServiceAssignments > [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, > cfg=LazyServiceConfiguration [srvcClsName=com.tophold. > trade.ignite.service.CommandRemoteService, svcCls=, > nodeFilterCls=CommandServiceNodeFilter], > assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true], > startVer=1516183996434, ver=GridCacheVersion [topVer=127663998, > order=1516184119343, nodeOrder=10], hash=-1440463172, > extras=GridCacheMvccEntryExtras > [mvcc=GridCacheMvcc [locs=null, rmts=[GridCacheMvccCandidate > [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, ver=GridCacheVersion > [topVer=127664000, order=1516193727420, nodeOrder=10], threadId=585, id=82, > topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], reentry=null, > otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null, > mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, > key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey > [name=CRS_com_tophold_trade_product_command], hasValBytes=true], > masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_ > implicit=0|dht_local=0|near_local=0|removed=0|read=0, prevVer=null, > nextVer=null]]]], flags=2]]], prepared=1, locked=false, nodeId=null, > locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, > partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter > [xidVer=GridCacheVersion [topVer=127664000, order=1516193727420, > nodeOrder=10], writeVer=GridCacheVersion [topVer=127664000, > order=1516193727421, nodeOrder=10], implicit=false, loc=false, > threadId=585, startTime=1516186483489, > nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, > startVer=GridCacheVersion [topVer=127664000, order=1516193739547, > nodeOrder=5], endVer=null, isolation=REPEATABLE_READ, > concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=true, plc=5, > commitVer=null, finalizing=NONE, invalidParts=null, state=PREPARED, > timedOut=false, topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], > duration=36138ms, onePhaseCommit=false]]]] > > You have the pending transaction in logs related to the service > deployment. Most possible that your service threw NPE in init(or other) > method and wasn't deployed. Could you check if it's possible that your > service will throw NPE? > > Evgenii > > > 2018-01-17 15:40 GMT+03:00 aa...@tophold.com <aa...@tophold.com>: > >> Hi Evgenii, >> >> What's more interesting If we reboot them in very shut time like one >> hour, from our monitor log we can find >> >> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly . >> >> But if after several hours, problem below sure will happen if you try to >> reboot any node from cluster. >> >> >> Regards >> Aaron >> ------------------------------ >> Aaron.Kuai >> >> *From:* aa...@tophold.com >> *Date:* 2018-01-17 20:05 >> *To:* user <user@ignite.apache.org> >> *Subject:* Re: Re: Nodes can not join the cluster after reboot >> hi Evgenii, >> >> Thanks! We collect some logs, one is the server which is reboot, another >> two are two servers exist, one client only nodes. after reboot: >> >> 1. the reboot node never be totally brought up, waiting for ever. >> 2. other server nodes after get notification the reboot node down, soon >> hang up there also. >> 3. the pure client node, only call a remote service on the reboot node, >> also hang up there >> >> At around 2018-01-17 10:54 we reboot the node. From the log we can find: >> >> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc >> overyEvents - All server nodes for the following >> caches have left the cluster: 'PortfolioCommandService_SVC_C >> O_DUM_CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'P >> ortfolioGenericDomainEventEntry' >> >> Soon a ERROR log(Seem the only ERROR level log): >> >> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid >> ServiceProcessor - Error when executing service: null java. >> lang.IllegalStateException: Getting affinity for topology ve >> rsion earlier than affinity is calculated >> >> Then a lot WARN of >> >> "Failed to wait for partition release future........................." >> >> Then this forever loop there, from the diagnose nothing seem suspicious, >> All node eventually output very similar. >> >> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] >> diagnostic - Pending explicit locks: >> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] >> diagnostic - Pending cache futures: >> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] >> diagnostic - Pending atomic cache futures: >> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] >> diagnostic - Pending data streamer futures: >> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] >> diagnostic - Pending transaction deadlock detection futures: >> >> Some of our environment: >> >> 1. we open the peer class loading flag, but in fact we use fat jar every >> class is shared. >> 2. some nodes deploy service, we use them as RPC way. >> 3. most cache in fact is LOCAL, only when must we make them shared >> 4. use JDBC to persist important caches >> 5. TcpDiscoveryJdbcIpFinder as the finder >> >> All others configuration is according to the stand. >> >> Thanks for your time! >> >> Regards >> Aaron >> ------------------------------ >> Aaron.Kuai >> >> >> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com> >> *Date:* 2018-01-16 20:32 >> *To:* user <user@ignite.apache.org> >> *Subject:* Re: Nodes can not join the cluster after reboot >> Hi, >> >> Most possible that on the of the nodes you have hanged >> transaction/future/lock or even a deadlock, that's why new nodes can't join >> cluster - they can't perform exchange due to pending operation. Please >> share full logs from all nodes with thread dumps, it will help to find a >> root cause. >> >> Evgenii >> >> 2018-01-16 5:35 GMT+03:00 aa...@tophold.com <aa...@tophold.com>: >> >>> Hi All, >>> >>> We have a ignite cluster running about 20+ nodes, for any case JVM >>> memory issue we schedule reboot those nodes at middle night. >>> >>> but in order to keep the service supplied, we reboot them one by one >>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the >>> reboot nodes can never join to the cluster again. >>> >>> Eventually the entire cluster can not work any more forever waiting for >>> joining to the topology; we need to kill all and reboot from started, this >>> sound incredible. >>> >>> I not sure whether any more meet this issue before, or any mistake we >>> may make, attached is the ignite log. >>> >>> >>> Thanks for your time! >>> >>> Regards >>> Aaron >>> ------------------------------ >>> Aaron.Kuai >>> >> >> >