Re: Re: Nodes can not join the cluster after reboot

[email protected] Fri, 19 Jan 2018 00:36:12 -0800

Hi Evgenii, 

I trying to remove this part and use the @LoggerResource;  will have a try!  
thanks for your time.

Regards
Aaron

Aaron.Kuai

From: Evgenii Zhuravlev
Date: 2018-01-19 16:28
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, 

This Service instance after creating will be serialized and deserialized on the 
target nodes. So, the field of Logger will be serialized too and I don't think 
that it will be properly serialized with the possibility of deserialization 
since it holds context internally. It's not recommended to use such kinds of 
fields in Service, you should use 
@LoggerResource
private IgniteLogger log;instead. I'm not sure if it's the root cause, but it's 
definitely could cause some problems.Evgenii

2018-01-19 4:51 GMT+03:00 [email protected] <[email protected]>:
HI Evgenii,, 

Sure, thanks for your time!  this service work as a delegate and all request 
will route to a bean in our spring context. 

Thanks again!

Regards
Aaron

Aaron.Kuai

From: Evgenii Zhuravlev
Date: 2018-01-18 21:59
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
Aaron, could you share code of 
com.tophold.trade.ignite.service.CommandRemoteService ? 

Thanks,
Evgenii

2018-01-18 16:43 GMT+03:00 Evgenii Zhuravlev <[email protected]>:
Hi Aaron,

I think that the main problem is here: 

GridServiceProcessor - Error when executing service: null

diagnostic - Pending transactions:
[WARN ] 2018-01-17 10:55:19.632 [exchange-worker-#97%PortfolioEventIgnite%] 
[ig] diagnostic - >>> [txVer=AffinityTopologyVersion [topVer=15, 
minorTopVer=0], exchWait=true, tx=GridDhtTxRemote 
[nearNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, 
rmtFutId=14d5c930161-e4bd34f6-8b10-40b7-8f30-d243ec91c3f1, 
nearXidVer=GridCacheVersion [topVer=127664000, order=1516193727313, 
nodeOrder=1], storeWriteThrough=false, super=GridDistributedTxRemoteAdapter 
[explicitVers=null, started=true, commitAllowed=0, 
txState=IgniteTxRemoteSingleStateImpl [entry=IgniteTxEntry 
[key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey 
[name=CRS_com_tophold_trade_product_command], hasValBytes=true], 
cacheId=-2100569601, txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=72, 
val=GridServiceAssignmentsKey [name=CRS_com_tophold_trade_product_command], 
hasValBytes=true], cacheId=-2100569601], val=[op=UPDATE, val=CacheObjectImpl 
[val=GridServiceAssignments [nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, 
topVer=15, cfg=LazyServiceConfiguration 
[srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, 
nodeFilterCls=CommandServiceNodeFilter], 
assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true]], 
prevVal=[op=NOOP, val=null], oldVal=[op=NOOP, val=null], 
entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, 
explicitVer=null, dhtVer=null, filters=[], filtersPassed=false, 
filtersSet=false, entry=GridDhtCacheEntry [rdrs=[], part=72, 
super=GridDistributedCacheEntry [super=GridCacheMapEntry 
[key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey 
[name=CRS_com_tophold_trade_product_command], hasValBytes=true], 
val=CacheObjectImpl [val=GridServiceAssignments 
[nodeId=014f536a-3ce6-419e-8cce-bee44b1a73ed, topVer=13, 
cfg=LazyServiceConfiguration 
[srvcClsName=com.tophold.trade.ignite.service.CommandRemoteService, svcCls=, 
nodeFilterCls=CommandServiceNodeFilter], 
assigns={014f536a-3ce6-419e-8cce-bee44b1a73ed=1}], hasValBytes=true], 
startVer=1516183996434, ver=GridCacheVersion [topVer=127663998, 
order=1516184119343, nodeOrder=10], hash=-1440463172, 
extras=GridCacheMvccEntryExtras [mvcc=GridCacheMvcc [locs=null, 
rmts=[GridCacheMvccCandidate [nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, 
ver=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], 
threadId=585, id=82, topVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], 
reentry=null, otherNodeId=2a34fe34-d02f-4bf4-b404-c2701f456bfb, otherVer=null, 
mappedDhtNodes=null, mappedNearNodes=null, ownerVer=null, serOrder=null, 
key=KeyCacheObjectImpl [part=72, val=GridServiceAssignmentsKey 
[name=CRS_com_tophold_trade_product_command], hasValBytes=true], 
masks=local=0|owner=0|ready=0|reentry=0|used=0|tx=1|single_implicit=0|dht_local=0|near_local=0|removed=0|read=0,
 prevVer=null, nextVer=null]]]], flags=2]]], prepared=1, locked=false, 
nodeId=null, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, 
partUpdateCntr=0, serReadVer=null, xidVer=null]], super=IgniteTxAdapter 
[xidVer=GridCacheVersion [topVer=127664000, order=1516193727420, nodeOrder=10], 
writeVer=GridCacheVersion [topVer=127664000, order=1516193727421, 
nodeOrder=10], implicit=false, loc=false, threadId=585, 
startTime=1516186483489, nodeId=0a4fc43c-0495-4f3d-8f77-bbb569de5c00, 
startVer=GridCacheVersion [topVer=127664000, order=1516193739547, nodeOrder=5], 
endVer=null, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, 
sysInvalidate=false, sys=true, plc=5, commitVer=null, finalizing=NONE, 
invalidParts=null, state=PREPARED, timedOut=false, 
topVer=AffinityTopologyVersion [topVer=15, minorTopVer=0], duration=36138ms, 
onePhaseCommit=false]]]]

You have the pending transaction in logs related to the service deployment. 
Most possible that your service threw NPE in init(or other) method and wasn't 
deployed. Could you check if it's possible that your service will throw NPE?

Evgenii

2018-01-17 15:40 GMT+03:00 [email protected] <[email protected]>:
Hi Evgenii, 

What's more interesting If we reboot them in very shut time like one hour,  
from our monitor log we can find 

such like NODE_LEFT and NODE_JOIN events, every thing move smoothly .  

But if after several hours, problem below sure will happen if you try to reboot 
any node from cluster. 

Regards
Aaron

Aaron.Kuai

From: [email protected]
Date: 2018-01-17 20:05
To: user
Subject: Re: Re: Nodes can not join the cluster after reboot
hi Evgenii, 

Thanks!  We collect some logs, one is the server which is reboot, another two 
are two servers exist,  one client only nodes.  after reboot:

1. the reboot node never be totally brought up, waiting for ever. 
2. other server nodes after get notification the reboot node down, soon hang up 
there also. 
3. the pure client node, only call a remote service on the reboot node, also 
hang up there

At around 2018-01-17 10:54  we reboot the node. From the log we can find:

[WARN ] 2018-01-17 10:54:43.277 [sys-#471] [ig] ExchangeDiscoveryEvents - All 
server nodes for the following caches have left the cluster: 
'PortfolioCommandService_SVC_CO_DUM_CACHE', 
'PortfolioSnapshotGenericDomainEventEntry', 'PortfolioGenericDomainEventEntry' 

Soon a ERROR log(Seem the only ERROR level log):

[ERROR] 2018-01-17 10:54:43.280 [srvc-deploy-#143] [ig] GridServiceProcessor - 
Error when executing service: null java.lang.IllegalStateException: Getting 
affinity for topology version earlier than affinity is calculated

Then a lot WARN of 

"Failed to wait for partition release future........................."

Then this forever loop there, from the diagnose nothing seem suspicious,  All 
node eventually output very similar. 

[WARN ] 2018-01-17 10:55:19.608 [exchange-worker-#97] [ig] diagnostic - Pending 
explicit locks:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending 
cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending 
atomic cache futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending 
data streamer futures:
[WARN ] 2018-01-17 10:55:19.609 [exchange-worker-#97] [ig] diagnostic - Pending 
transaction deadlock detection futures:

Some of our environment:

1. we open the peer class loading flag, but in fact we use fat jar every class 
is shared.
2. some nodes deploy service, we use them as RPC way. 
3. most cache in fact is LOCAL, only when must we make them shared
4. use JDBC to persist important caches
5. TcpDiscoveryJdbcIpFinder as the finder

All others configuration is according to the stand. 

Thanks for your time!

Regards
Aaron

Aaron.Kuai

From: Evgenii Zhuravlev
Date: 2018-01-16 20:32
To: user
Subject: Re: Nodes can not join the cluster after reboot
Hi,

Most possible that on the of the nodes you have hanged transaction/future/lock 
or even a deadlock, that's why new nodes can't join cluster - they can't 
perform exchange due to pending operation. Please share full logs from all 
nodes with thread dumps, it will help to find a root cause.

Evgenii

2018-01-16 5:35 GMT+03:00 [email protected] <[email protected]>:
Hi All, 

We have a ignite cluster running about 20+ nodes,   for any case JVM memory 
issue we schedule reboot those nodes at middle night. 

but in order to keep the service supplied, we reboot them one by one like 
A,B,C,D nodes we reboot them at 5 mins delay; but if we doing so, the reboot 
nodes can never join to the cluster again. 

Eventually the entire cluster can not work any more forever waiting for joining 
to the topology; we need to kill all and reboot from started, this sound 
incredible. 

I not sure whether any more meet this issue before, or any mistake we may make, 
attached is the ignite log. 

Thanks for your time!

Regards
Aaron

Aaron.Kuai

Re: Re: Nodes can not join the cluster after reboot

Reply via email to