Please let us know if this helped you Evgenii
2018-01-19 11:35 GMT+03:00 aa...@tophold.com <aa...@tophold.com>: > Hi Evgenii, > > I trying to remove this part and use the @LoggerResource; will have a > try! thanks for your time. > > > Regards > Aaron > ------------------------------ > Aaron.Kuai > > > *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com> > *Date:* 2018-01-19 16:28 > *To:* user <user@ignite.apache.org> > *Subject:* Re: Re: Nodes can not join the cluster after reboot > Aaron, > > This Service instance after creating will be serialized and deserialized > on the target nodes. So, the field of Logger will be serialized too and I > don't think that it will be properly serialized with the possibility of > deserialization since it holds context internally. It's not recommended to > use such kinds of fields in Service, you should use > > @LoggerResource > private IgniteLogger log; > > instead. I'm not sure if it's the root cause, but it's definitely could cause > some problems. > > Evgenii > > > 2018-01-19 4:51 GMT+03:00 aa...@tophold.com <aa...@tophold.com>: > >> HI Evgenii,, >> >> Sure, thanks for your time! this service work as a delegate and all >> request will route to a bean in our spring context. >> >> Thanks again! >> >> Regards >> Aaron >> ------------------------------ >> Aaron.Kuai >> >> >> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com> >> *Date:* 2018-01-18 21:59 >> *To:* user <user@ignite.apache.org> >> *Subject:* Re: Re: Nodes can not join the cluster after reboot >> Aaron, could you share code of >> com.tophold.trade.ignite.service.CommandRemoteService >> ? >> >> Thanks, >> Evgenii >> >> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e.zhuravlev...@gmail.com>: >> >>> Hi Aaron, >>> >>> I think that the main problem is here: >>> >>> GridServiceProcessor - Error when executing service: null >>> >>> diagnostic - Pending transactions: >>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] >>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, >>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote >>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, >>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, >>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, >>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter >>> [explicitVers=null, started=true, commitAllowed=0, >>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry >>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey >>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true], >>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72, >>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], >>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE, >>> val=CacheObjectImpl [val=GridServiceAssignments >>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15, >>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade >>> .ignite.service.CommandRemoteService, svcCls=, >>> nodeFilterCls=CommandServiceNodeFilter], >>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], >>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, >>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, >>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[], >>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], >>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry >>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey >>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true], >>> val=CacheObjectImpl [val=GridServiceAssignments >>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, >>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade >>> .ignite.service.CommandRemoteService, svcCls=, >>> nodeFilterCls=CommandServiceNodeFilter], >>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], >>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion >>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172, >>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, >>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, >>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], >>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1, >>> minorTopVer=0], reentry=null, >>> otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, >>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, >>> serOrder=null, key=KeyCacheObjectImpl [part=72, >>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], >>> hasValBytes=true], masks=local=0|owner=0|ready=0| >>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0, >>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, >>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, >>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]], >>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000, >>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion >>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false, >>> loc=false, threadId=585, startTime=1516186483489, >>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion >>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null, >>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, >>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE, >>> invalidParts=null, state=PREPARED, timedOut=false, >>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], >>> duration=36138ms, onePhaseCommit=false]]]] >>> >>> You have the pending transaction in logs related to the service >>> deployment. Most possible that your service threw NPE in init(or other) >>> method and wasn't deployed. Could you check if it's possible that your >>> service will throw NPE? >>> >>> Evgenii >>> >>> >>> 2018-01-17 15:40 GMT+03:00 aa...@tophold.com <aa...@tophold.com>: >>> >>>> Hi Evgenii, >>>> >>>> What's more interesting If we reboot them in very shut time like one >>>> hour, from our monitor log we can find >>>> >>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly . >>>> >>>> But if after several hours, problem below sure will happen if you try >>>> to reboot any node from cluster. >>>> >>>> >>>> Regards >>>> Aaron >>>> ------------------------------ >>>> Aaron.Kuai >>>> >>>> *From:* aa...@tophold.com >>>> *Date:* 2018-01-17 20:05 >>>> *To:* user <user@ignite.apache.org> >>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot >>>> hi Evgenii, >>>> >>>> Thanks! We collect some logs, one is the server which is reboot, >>>> another two are two servers exist, one client only nodes. after reboot: >>>> >>>> 1. the reboot node never be totally brought up, waiting for ever. >>>> 2. other server nodes after get notification the reboot node down, soon >>>> hang up there also. >>>> 3. the pure client node, only call a remote service on the reboot node, >>>> also hang up there >>>> >>>> At around 2018-01-17 10:54 we reboot the node. From the log we can >>>> find: >>>> >>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc >>>> overyEvents - All server nodes for the following caches have >>>> left the cluster: 'PortfolioCommandService_SVC_CO_DUM_ >>>> CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'Portfol >>>> ioGenericDomainEventEntry' >>>> >>>> Soon a ERROR log(Seem the only ERROR level log): >>>> >>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid >>>> ServiceProcessor - Error when executing service: null java.l >>>> ang.IllegalStateException: Getting affinity for topology ver >>>> sion earlier than affinity is calculated >>>> >>>> Then a lot WARN of >>>> >>>> "Failed to wait for partition release future........................." >>>> >>>> Then this forever loop there, from the diagnose nothing seem suspicious, >>>> All node eventually output very similar. >>>> >>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d >>>> iagnostic - Pending explicit locks: >>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>> iagnostic - Pending cache futures: >>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>> iagnostic - Pending atomic cache futures: >>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>> iagnostic - Pending data streamer futures: >>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d >>>> iagnostic - Pending transaction deadlock detection futures: >>>> >>>> Some of our environment: >>>> >>>> 1. we open the peer class loading flag, but in fact we use fat jar >>>> every class is shared. >>>> 2. some nodes deploy service, we use them as RPC way. >>>> 3. most cache in fact is LOCAL, only when must we make them shared >>>> 4. use JDBC to persist important caches >>>> 5. TcpDiscoveryJdbcIpFinder as the finder >>>> >>>> All others configuration is according to the stand. >>>> >>>> Thanks for your time! >>>> >>>> Regards >>>> Aaron >>>> ------------------------------ >>>> Aaron.Kuai >>>> >>>> >>>> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com> >>>> *Date:* 2018-01-16 20:32 >>>> *To:* user <user@ignite.apache.org> >>>> *Subject:* Re: Nodes can not join the cluster after reboot >>>> Hi, >>>> >>>> Most possible that on the of the nodes you have hanged >>>> transaction/future/lock or even a deadlock, that's why new nodes can't join >>>> cluster - they can't perform exchange due to pending operation. Please >>>> share full logs from all nodes with thread dumps, it will help to find a >>>> root cause. >>>> >>>> Evgenii >>>> >>>> 2018-01-16 5:35 GMT+03:00 aa...@tophold.com <aa...@tophold.com>: >>>> >>>>> Hi All, >>>>> >>>>> We have a ignite cluster running about 20+ nodes, for any case JVM >>>>> memory issue we schedule reboot those nodes at middle night. >>>>> >>>>> but in order to keep the service supplied, we reboot them one by one >>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the >>>>> reboot nodes can never join to the cluster again. >>>>> >>>>> Eventually the entire cluster can not work any more forever waiting >>>>> for joining to the topology; we need to kill all and reboot from started, >>>>> this sound incredible. >>>>> >>>>> I not sure whether any more meet this issue before, or any mistake we >>>>> may make, attached is the ignite log. >>>>> >>>>> >>>>> Thanks for your time! >>>>> >>>>> Regards >>>>> Aaron >>>>> ------------------------------ >>>>> Aaron.Kuai >>>>> >>>> >>>> >>> >> >