[ https://issues.apache.org/jira/browse/JCS-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lukas Doros updated JCS-242: ---------------------------- Description: When starting up Lateral Cache and a remote node is not available, it will not be retried. *Scenario:* 2 nodes, A and B. Both are shutdown. A is starting, B is not available yet, therefore connecting fails. B starts, can connect to A. A will not try again. *Reason/Problem:* LateralTCPCacheFactory (line 278 following) {code:java} newService = new LateralTCPService<>(lca, elementSerializer); } catch ( final IOException ex ) { // Failed to connect to the lateral server. // Configure this LateralCacheManager instance to use the // "zombie" services. log.error( "Failure, lateral instance will use zombie service", ex ); newService = new ZombieCacheServiceNonLocal<>(lca.getZombieQueueMaxSize()); // Notify the cache monitor about the error, and kick off // the recovery process. monitor.notifyError(); } {code} new LateralTCPService fails, monitor is notified about the issue and is expected to retry the connect. BUT when the monitor immediatly tries to reconnect, it fails. LateralCacheMonitor (line 113 following) {code:java} caches.forEach((cacheName, cache) -> { if (cache.getStatus() == CacheStatus.ERROR) { log.info( "Found LateralCacheNoWait in error, " + cacheName ); final ITCPLateralCacheAttributes lca = (ITCPLateralCacheAttributes) cache.getAuxiliaryCacheAttributes(); // Get service instance final ICacheServiceNonLocal<Object, Object> cacheService = factory.getCSNLInstance(lca, cache.getElementSerializer()); // If we can't fix them, just skip and re-try in the // next round. if (!(cacheService instanceof ZombieCacheServiceNonLocal)) { cache.fixCache(cacheService); } } }); {code} At this time, "caches" is empty, nothing is done and 'allright' is set to true. Back to LateralTCPCacheFactory (line 111 following). At line 114 'caches' is populated, but that's to late. {code:java} final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, cacheEventLogger, elementSerializer); // <-- inside here exception is catched and monitor notified addListenerIfNeeded( lacClone, cacheMgr, elementSerializer ); monitorCache(lateralNoWait); // <-- here 'caches' is populated. noWaits.add( lateralNoWait ); {code} *Possible Solution:* {code:java} final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, cacheEventLogger, elementSerializer); addListenerIfNeeded( lacClone, cacheMgr, elementSerializer ); monitorCache(lateralNoWait); noWaits.add( lateralNoWait ); // CHANGE START if (lateralNoWait.getStatus() == CacheStatus.ERROR) { monitor.notifyError(); } // CHANGE END{code} Notifying monitor after 'caches' is populated. *Addendum:* I've attached a project with a test case for this problem. was: When starting up Lateral Cache and a remote node is not available, it will not be retried. *Scenario:* 2 nodes, A and B. Both are shutdown. A is starting, B is not available yet, therefore connecting fails. B starts, can connect to A. A will not try again. *Reason/Problem:* LateralTCPCacheFactory (line 278 following) {code:java} newService = new LateralTCPService<>(lca, elementSerializer); } catch ( final IOException ex ) { // Failed to connect to the lateral server. // Configure this LateralCacheManager instance to use the // "zombie" services. log.error( "Failure, lateral instance will use zombie service", ex ); newService = new ZombieCacheServiceNonLocal<>(lca.getZombieQueueMaxSize()); // Notify the cache monitor about the error, and kick off // the recovery process. monitor.notifyError(); } {code} new LateralTCPService fails, monitor is notified about the issue and is expected to retry the connect. BUT when the monitor immediatly tries to reconnect, it fails. LateralCacheMonitor (line 113 following) {code:java} caches.forEach((cacheName, cache) -> { if (cache.getStatus() == CacheStatus.ERROR) { log.info( "Found LateralCacheNoWait in error, " + cacheName ); final ITCPLateralCacheAttributes lca = (ITCPLateralCacheAttributes) cache.getAuxiliaryCacheAttributes(); // Get service instance final ICacheServiceNonLocal<Object, Object> cacheService = factory.getCSNLInstance(lca, cache.getElementSerializer()); // If we can't fix them, just skip and re-try in the // next round. if (!(cacheService instanceof ZombieCacheServiceNonLocal)) { cache.fixCache(cacheService); } } }); {code} At this time, "caches" is empty, nothing is done and 'allright' is set to true. Back to LateralTCPCacheFactory (line 111 following). At line 114 'caches' is populated, but that's to late. {code:java} final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, cacheEventLogger, elementSerializer); // <-- inside here exception is catched and monitor notified addListenerIfNeeded( lacClone, cacheMgr, elementSerializer ); monitorCache(lateralNoWait); // <-- here 'caches' is populated. noWaits.add( lateralNoWait ); {code} *Possible Solution:* {code:java} final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, cacheEventLogger, elementSerializer); addListenerIfNeeded( lacClone, cacheMgr, elementSerializer ); monitorCache(lateralNoWait); noWaits.add( lateralNoWait ); // CHANGE START if (lateralNoWait.getStatus() == CacheStatus.ERROR) { monitor.notifyError(); } // CHANGE END{code} Notifying monitor after 'caches' is populated. {*}Addendum:{*}{*}{*} I've attached a project with a test case for this problem. > Lateral Cache init timing bug > ----------------------------- > > Key: JCS-242 > URL: https://issues.apache.org/jira/browse/JCS-242 > Project: Commons JCS > Issue Type: Bug > Components: TCP Lateral Cache > Affects Versions: jcs-3.2.1 > Reporter: Lukas Doros > Priority: Major > Attachments: lateral-bug.zip > > > When starting up Lateral Cache and a remote node is not available, it will > not be retried. > *Scenario:* > 2 nodes, A and B. Both are shutdown. > A is starting, B is not available yet, therefore connecting fails. > B starts, can connect to A. > A will not try again. > *Reason/Problem:* > LateralTCPCacheFactory (line 278 following) > {code:java} > newService = new LateralTCPService<>(lca, elementSerializer); > } > catch ( final IOException ex ) > { > // Failed to connect to the lateral server. > // Configure this LateralCacheManager instance to use the > // "zombie" services. > log.error( "Failure, lateral instance will use zombie service", ex ); > newService = new > ZombieCacheServiceNonLocal<>(lca.getZombieQueueMaxSize()); > // Notify the cache monitor about the error, and kick off > // the recovery process. > monitor.notifyError(); > } {code} > new LateralTCPService fails, monitor is notified about the issue and is > expected to retry the connect. > BUT when the monitor immediatly tries to reconnect, it fails. > > LateralCacheMonitor (line 113 following) > {code:java} > caches.forEach((cacheName, cache) -> { > if (cache.getStatus() == CacheStatus.ERROR) > { > log.info( "Found LateralCacheNoWait in error, " + cacheName ); > final ITCPLateralCacheAttributes lca = > (ITCPLateralCacheAttributes) > cache.getAuxiliaryCacheAttributes(); > // Get service instance > final ICacheServiceNonLocal<Object, Object> cacheService = > factory.getCSNLInstance(lca, cache.getElementSerializer()); > // If we can't fix them, just skip and re-try in the > // next round. > if (!(cacheService instanceof ZombieCacheServiceNonLocal)) > { > cache.fixCache(cacheService); > } > } > }); {code} > At this time, "caches" is empty, nothing is done and 'allright' is set to > true. > > Back to LateralTCPCacheFactory (line 111 following). > At line 114 'caches' is populated, but that's to late. > {code:java} > final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, > cacheEventLogger, elementSerializer); // <-- inside here exception is > catched and monitor notified > addListenerIfNeeded( lacClone, cacheMgr, elementSerializer ); > monitorCache(lateralNoWait); // <-- here 'caches' is populated. > noWaits.add( lateralNoWait ); {code} > > *Possible Solution:* > {code:java} > final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, > cacheEventLogger, elementSerializer); > addListenerIfNeeded( lacClone, cacheMgr, elementSerializer ); > monitorCache(lateralNoWait); > noWaits.add( lateralNoWait ); > // CHANGE START > if (lateralNoWait.getStatus() == CacheStatus.ERROR) { > monitor.notifyError(); > } > // CHANGE END{code} > Notifying monitor after 'caches' is populated. > > *Addendum:* > I've attached a project with a test case for this problem. -- This message was sent by Atlassian Jira (v8.20.10#820010)