Re: Re: Nodes can not join the cluster after reboot

Evgenii Zhuravlev Fri, 19 Jan 2018 00:36:54 -0800

Please let us know if this helped you

Evgenii


2018-01-19 11:35 GMT+03:00 aa...@tophold.com <aa...@tophold.com>:

> Hi Evgenii,
>
> I trying to remove this part and use the @LoggerResource;  will have a
> try!  thanks for your time.
>
>
> Regards
> Aaron
> ------------------------------
> Aaron.Kuai
>
>
> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com>
> *Date:* 2018-01-19 16:28
> *To:* user <user@ignite.apache.org>
> *Subject:* Re: Re: Nodes can not join the cluster after reboot
> Aaron,
>
> This Service instance after creating will be serialized and deserialized
> on the target nodes. So, the field of Logger will be serialized too and I
> don't think that it will be properly serialized with the possibility of
> deserialization since it holds context internally. It's not recommended to
> use such kinds of fields in Service, you should use
>
> @LoggerResource
> private IgniteLogger log;
>
> instead. I'm not sure if it's the root cause, but it's definitely could cause 
> some problems.
>
> Evgenii
>
>
> 2018-01-19 4:51 GMT+03:00 aa...@tophold.com <aa...@tophold.com>:
>
>> HI Evgenii,,
>>
>> Sure, thanks for your time!  this service work as a delegate and all
>> request will route to a bean in our spring context.
>>
>> Thanks again!
>>
>> Regards
>> Aaron
>> ------------------------------
>> Aaron.Kuai
>>
>>
>> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com>
>> *Date:* 2018-01-18 21:59
>> *To:* user <user@ignite.apache.org>
>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>> Aaron, could you share code of 
>> com.tophold.trade.ignite.service.CommandRemoteService
>> ?
>>
>> Thanks,
>> Evgenii
>>
>> 2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <e.zhuravlev...@gmail.com>:
>>
>>> Hi Aaron,
>>>
>>> I think that the main problem is here:
>>>
>>> GridServiceProcessor - Error when executing service: null
>>>
>>> diagnostic - Pending transactions:
>>> [WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%]
>>> [ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15,
>>> minorTopVer=0], exchWait=true, tx=GridDhtTxRemote
>>> [nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>> rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1,
>>> nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313,
>>> nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter
>>> [explicitVers=null, started=true, commitAllowed=0,
>>> txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry
>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>> cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72,
>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>> hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE,
>>> val=CacheObjectImpl [val=GridServiceAssignments
>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=15,
>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>> .ignite.service.CommandRemoteService, svcCls=,
>>> nodeFilterCls=CommandServiceNodeFilter], 
>>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>> hasValBytes=true]], prevVal=[op=NOOP, val=null], oldVal=[op=NOOP,
>>> val=null], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1,
>>> conflictVer=null, explicitVer=null, dhtVer=null, filters=[],
>>> filtersPassed=false, filtersSet=false, entry=GridDhtCacheEntry [rdrs=[],
>>> part=72, super=GridDistributedCacheEntry [super=GridCacheMapEntry
>>> [key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey
>>> [name=CRS_com_tophold_trade_product_command], hasValBytes=true],
>>> val=CacheObjectImpl [val=GridServiceAssignments
>>> [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13,
>>> cfg=LazyServiceConfiguration [srvcClsName=com.tophold.trade
>>> .ignite.service.CommandRemoteService, svcCls=,
>>> nodeFilterCls=CommandServiceNodeFilter], 
>>> assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}],
>>> hasValBytes=true], startVer=1516183996434, ver=GridCacheVersion
>>> [topVer=127663998, order=1516184119343, nodeOrder=10], hash=-1440463172,
>>> extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null,
>>> rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00,
>>> ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10],
>>> threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1,
>>> minorTopVer=0], reentry=null, 
>>> otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb,
>>> otherVer=null, mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null,
>>> serOrder=null, key=KeyCacheObjectImpl [part=72,
>>> val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command],
>>> hasValBytes=true], masks=local=0|owner=0|ready=0|
>>> reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
>>> prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false,
>>> nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false,
>>> flags=0, partUpdateCntr=0, serReadVer=null, xidVer=null]],
>>> super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=127664000,
>>> order=1516193727420, nodeOrder=10], writeVer=GridCacheVersion
>>> [topVer=127664000, order=1516193727421, nodeOrder=10], implicit=false,
>>> loc=false, threadId=585, startTime=1516186483489,
>>> nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, startVer=GridCacheVersion
>>> [topVer=127664000, order=1516193739547, nodeOrder=5], endVer=null,
>>> isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0,
>>> sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE,
>>> invalidParts=null, state=PREPARED, timedOut=false,
>>> topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0],
>>> duration=36138ms, onePhaseCommit=false]]]]
>>>
>>> You have the pending transaction in logs related to the service
>>> deployment. Most possible that your service threw NPE in init(or other)
>>> method and wasn't deployed. Could you check if it's possible that your
>>> service will throw NPE?
>>>
>>> Evgenii
>>>
>>>
>>> 2018-01-17 15:40 GMT+03:00 aa...@tophold.com <aa...@tophold.com>:
>>>
>>>> Hi Evgenii,
>>>>
>>>> What's more interesting If we reboot them in very shut time like one
>>>> hour,  from our monitor log we can find
>>>>
>>>> such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .
>>>>
>>>> But if after several hours, problem below sure will happen if you try
>>>> to reboot any node from cluster.
>>>>
>>>>
>>>> Regards
>>>> Aaron
>>>> ------------------------------
>>>> Aaron.Kuai
>>>>
>>>> *From:* aa...@tophold.com
>>>> *Date:* 2018-01-17 20:05
>>>> *To:* user <user@ignite.apache.org>
>>>> *Subject:* Re: Re: Nodes can not join the cluster after reboot
>>>> hi Evgenii,
>>>>
>>>> Thanks!  We collect some logs, one is the server which is reboot,
>>>> another two are two servers exist,  one client only nodes.  after reboot:
>>>>
>>>> 1. the reboot node never be totally brought up, waiting for ever.
>>>> 2. other server nodes after get notification the reboot node down, soon
>>>> hang up there also.
>>>> 3. the pure client node, only call a remote service on the reboot node,
>>>> also hang up there
>>>>
>>>> At around 2018-01-17 10:54  we reboot the node. From the log we can
>>>> find:
>>>>
>>>> [WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDisc
>>>> overyEvents - All server nodes for the following caches have
>>>>  left the cluster: 'PortfolioCommandService_SVC_CO_DUM_
>>>> CACHE', 'PortfolioSnapshotGenericDomainEventEntry', 'Portfol
>>>> ioGenericDomainEventEntry'
>>>>
>>>> Soon a ERROR log(Seem the only ERROR level log):
>>>>
>>>> [ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] Grid
>>>> ServiceProcessor - Error when executing service: null java.l
>>>> ang.IllegalStateException: Getting affinity for topology ver
>>>> sion earlier than affinity is calculated
>>>>
>>>> Then a lot WARN of
>>>>
>>>> "Failed to wait for partition release future........................."
>>>>
>>>> Then this forever loop there, from the diagnose nothing seem suspicious,
>>>>  All node eventually output very similar.
>>>>
>>>> [WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending explicit locks:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending cache futures:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending atomic cache futures:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending data streamer futures:
>>>> [WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] d
>>>> iagnostic - Pending transaction deadlock detection futures:
>>>>
>>>> Some of our environment:
>>>>
>>>> 1. we open the peer class loading flag, but in fact we use fat jar
>>>> every class is shared.
>>>> 2. some nodes deploy service, we use them as RPC way.
>>>> 3. most cache in fact is LOCAL, only when must we make them shared
>>>> 4. use JDBC to persist important caches
>>>> 5. TcpDiscoveryJdbcIpFinder as the finder
>>>>
>>>> All others configuration is according to the stand.
>>>>
>>>> Thanks for your time!
>>>>
>>>> Regards
>>>> Aaron
>>>> ------------------------------
>>>> Aaron.Kuai
>>>>
>>>>
>>>> *From:* Evgenii Zhuravlev <e.zhuravlev...@gmail.com>
>>>> *Date:* 2018-01-16 20:32
>>>> *To:* user <user@ignite.apache.org>
>>>> *Subject:* Re: Nodes can not join the cluster after reboot
>>>> Hi,
>>>>
>>>> Most possible that on the of the nodes you have hanged
>>>> transaction/future/lock or even a deadlock, that's why new nodes can't join
>>>> cluster - they can't perform exchange due to pending operation. Please
>>>> share full logs from all nodes with thread dumps, it will help to find a
>>>> root cause.
>>>>
>>>> Evgenii
>>>>
>>>> 2018-01-16 5:35 GMT+03:00 aa...@tophold.com <aa...@tophold.com>:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We have a ignite cluster running about 20+ nodes,   for any case JVM
>>>>> memory issue we schedule reboot those nodes at middle night.
>>>>>
>>>>> but in order to keep the service supplied, we reboot them one by one
>>>>> like A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the
>>>>> reboot nodes can never join to the cluster again.
>>>>>
>>>>> Eventually the entire cluster can not work any more forever waiting
>>>>> for joining to the topology; we need to kill all and reboot from started,
>>>>> this sound incredible.
>>>>>
>>>>> I not sure whether any more meet this issue before, or any mistake we
>>>>> may make, attached is the ignite log.
>>>>>
>>>>>
>>>>> Thanks for your time!
>>>>>
>>>>> Regards
>>>>> Aaron
>>>>> ------------------------------
>>>>> Aaron.Kuai
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Re: Nodes can not join the cluster after reboot

Reply via email to