[jira] [Updated] (JCS-242) Lateral Cache init timing bug

Lukas Doros (Jira) Mon, 02 Jun 2025 08:23:08 -0700


     [ 
https://issues.apache.org/jira/browse/JCS-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lukas Doros updated JCS-242:
----------------------------
    Description: 
When starting up Lateral Cache and a remote node is not available, it will not 
be retried.

*Scenario:*

2 nodes, A and B. Both are shutdown.
A is starting, B is not available yet, therefore connecting fails.
B starts, can connect to A.

A will not try again.

*Reason/Problem:*
LateralTCPCacheFactory (line 278 following)
{code:java}
    newService = new LateralTCPService<>(lca, elementSerializer);
}
catch ( final IOException ex )
{
    // Failed to connect to the lateral server.
    // Configure this LateralCacheManager instance to use the
    // "zombie" services.
    log.error( "Failure, lateral instance will use zombie service", ex );

    newService = new ZombieCacheServiceNonLocal<>(lca.getZombieQueueMaxSize());

    // Notify the cache monitor about the error, and kick off
    // the recovery process.
    monitor.notifyError();
} {code}
new LateralTCPService fails, monitor is notified about the issue and is 
expected to retry the connect.

BUT when the monitor immediatly tries to reconnect, it fails.

 

LateralCacheMonitor (line 113 following)
{code:java}
caches.forEach((cacheName, cache) -> {

    if (cache.getStatus() == CacheStatus.ERROR)
    {
        log.info( "Found LateralCacheNoWait in error, " + cacheName );

        final ITCPLateralCacheAttributes lca =
                (ITCPLateralCacheAttributes) 
cache.getAuxiliaryCacheAttributes();

        // Get service instance
        final ICacheServiceNonLocal<Object, Object> cacheService =
                factory.getCSNLInstance(lca, cache.getElementSerializer());

        // If we can't fix them, just skip and re-try in the
        // next round.
        if (!(cacheService instanceof ZombieCacheServiceNonLocal))
        {
            cache.fixCache(cacheService);
        }
    }
}); {code}
At this time, "caches" is empty, nothing is done and 'allright' is set to true.

 

Back to LateralTCPCacheFactory (line 111 following).
At line 114 'caches' is populated, but that's to late.
{code:java}
final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, 
cacheEventLogger, elementSerializer);  // <-- inside here exception is catched 
and monitor notified

addListenerIfNeeded( lacClone, cacheMgr, elementSerializer );
monitorCache(lateralNoWait); // <-- here 'caches' is populated.
noWaits.add( lateralNoWait ); {code}
 

*Possible Solution:*
{code:java}
final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, 
cacheEventLogger, elementSerializer);

addListenerIfNeeded( lacClone, cacheMgr, elementSerializer );
monitorCache(lateralNoWait);
noWaits.add( lateralNoWait );

// CHANGE START
if (lateralNoWait.getStatus() == CacheStatus.ERROR) {
    monitor.notifyError();
} 
// CHANGE END{code}
Notifying monitor after 'caches' is populated.

 

*Addendum:*

I've attached a project with a test case for this problem.

  was:
When starting up Lateral Cache and a remote node is not available, it will not 
be retried.

 

*Scenario:*

2 nodes, A and B. Both are shutdown.
A is starting, B is not available yet, therefore connecting fails.
B starts, can connect to A.

A will not try again.

*Reason/Problem:*
LateralTCPCacheFactory (line 278 following)
{code:java}
    newService = new LateralTCPService<>(lca, elementSerializer);
}
catch ( final IOException ex )
{
    // Failed to connect to the lateral server.
    // Configure this LateralCacheManager instance to use the
    // "zombie" services.
    log.error( "Failure, lateral instance will use zombie service", ex );

    newService = new ZombieCacheServiceNonLocal<>(lca.getZombieQueueMaxSize());

    // Notify the cache monitor about the error, and kick off
    // the recovery process.
    monitor.notifyError();
} {code}
new LateralTCPService fails, monitor is notified about the issue and is 
expected to retry the connect.

BUT when the monitor immediatly tries to reconnect, it fails.

 

LateralCacheMonitor (line 113 following)
{code:java}
caches.forEach((cacheName, cache) -> {

    if (cache.getStatus() == CacheStatus.ERROR)
    {
        log.info( "Found LateralCacheNoWait in error, " + cacheName );

        final ITCPLateralCacheAttributes lca =
                (ITCPLateralCacheAttributes) 
cache.getAuxiliaryCacheAttributes();

        // Get service instance
        final ICacheServiceNonLocal<Object, Object> cacheService =
                factory.getCSNLInstance(lca, cache.getElementSerializer());

        // If we can't fix them, just skip and re-try in the
        // next round.
        if (!(cacheService instanceof ZombieCacheServiceNonLocal))
        {
            cache.fixCache(cacheService);
        }
    }
}); {code}
At this time, "caches" is empty, nothing is done and 'allright' is set to true.

 

Back to LateralTCPCacheFactory (line 111 following).
At line 114 'caches' is populated, but that's to late.
{code:java}
final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, 
cacheEventLogger, elementSerializer);  // <-- inside here exception is catched 
and monitor notified

addListenerIfNeeded( lacClone, cacheMgr, elementSerializer );
monitorCache(lateralNoWait); // <-- here 'caches' is populated.
noWaits.add( lateralNoWait ); {code}
 

*Possible Solution:*
{code:java}
final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, 
cacheEventLogger, elementSerializer);

addListenerIfNeeded( lacClone, cacheMgr, elementSerializer );
monitorCache(lateralNoWait);
noWaits.add( lateralNoWait );

// CHANGE START
if (lateralNoWait.getStatus() == CacheStatus.ERROR) {
    monitor.notifyError();
} 
// CHANGE END{code}
Notifying monitor after 'caches' is populated.

 

{*}Addendum:{*}{*}{*}

I've attached a project with a test case for this problem.


> Lateral Cache init timing bug
> -----------------------------
>
>                 Key: JCS-242
>                 URL: https://issues.apache.org/jira/browse/JCS-242
>             Project: Commons JCS
>          Issue Type: Bug
>          Components: TCP Lateral Cache
>    Affects Versions: jcs-3.2.1
>            Reporter: Lukas Doros
>            Priority: Major
>         Attachments: lateral-bug.zip
>
>
> When starting up Lateral Cache and a remote node is not available, it will 
> not be retried.
> *Scenario:*
> 2 nodes, A and B. Both are shutdown.
> A is starting, B is not available yet, therefore connecting fails.
> B starts, can connect to A.
> A will not try again.
> *Reason/Problem:*
> LateralTCPCacheFactory (line 278 following)
> {code:java}
>     newService = new LateralTCPService<>(lca, elementSerializer);
> }
> catch ( final IOException ex )
> {
>     // Failed to connect to the lateral server.
>     // Configure this LateralCacheManager instance to use the
>     // "zombie" services.
>     log.error( "Failure, lateral instance will use zombie service", ex );
>     newService = new 
> ZombieCacheServiceNonLocal<>(lca.getZombieQueueMaxSize());
>     // Notify the cache monitor about the error, and kick off
>     // the recovery process.
>     monitor.notifyError();
> } {code}
> new LateralTCPService fails, monitor is notified about the issue and is 
> expected to retry the connect.
> BUT when the monitor immediatly tries to reconnect, it fails.
>  
> LateralCacheMonitor (line 113 following)
> {code:java}
> caches.forEach((cacheName, cache) -> {
>     if (cache.getStatus() == CacheStatus.ERROR)
>     {
>         log.info( "Found LateralCacheNoWait in error, " + cacheName );
>         final ITCPLateralCacheAttributes lca =
>                 (ITCPLateralCacheAttributes) 
> cache.getAuxiliaryCacheAttributes();
>         // Get service instance
>         final ICacheServiceNonLocal<Object, Object> cacheService =
>                 factory.getCSNLInstance(lca, cache.getElementSerializer());
>         // If we can't fix them, just skip and re-try in the
>         // next round.
>         if (!(cacheService instanceof ZombieCacheServiceNonLocal))
>         {
>             cache.fixCache(cacheService);
>         }
>     }
> }); {code}
> At this time, "caches" is empty, nothing is done and 'allright' is set to 
> true.
>  
> Back to LateralTCPCacheFactory (line 111 following).
> At line 114 'caches' is populated, but that's to late.
> {code:java}
> final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, 
> cacheEventLogger, elementSerializer);  // <-- inside here exception is 
> catched and monitor notified
> addListenerIfNeeded( lacClone, cacheMgr, elementSerializer );
> monitorCache(lateralNoWait); // <-- here 'caches' is populated.
> noWaits.add( lateralNoWait ); {code}
>  
> *Possible Solution:*
> {code:java}
> final LateralCacheNoWait<K, V> lateralNoWait = createCacheNoWait(lacClone, 
> cacheEventLogger, elementSerializer);
> addListenerIfNeeded( lacClone, cacheMgr, elementSerializer );
> monitorCache(lateralNoWait);
> noWaits.add( lateralNoWait );
> // CHANGE START
> if (lateralNoWait.getStatus() == CacheStatus.ERROR) {
>     monitor.notifyError();
> } 
> // CHANGE END{code}
> Notifying monitor after 'caches' is populated.
>  
> *Addendum:*
> I've attached a project with a test case for this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (JCS-242) Lateral Cache init timing bug

Reply via email to