[ https://issues.apache.org/jira/browse/ARTEMIS-2069?focusedWorklogId=189469&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-189469 ]
ASF GitHub Bot logged work on ARTEMIS-2069: ------------------------------------------- Author: ASF GitHub Bot Created on: 24/Jan/19 13:33 Start Date: 24/Jan/19 13:33 Worklog Time Spent: 10m Work Description: michaelandrepearce commented on pull request #2287: ARTEMIS-2069 Backup doesn't activate after shared store is reconnected URL: https://github.com/apache/activemq-artemis/pull/2287#discussion_r250605929 ########## File path: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ########## @@ -299,44 +301,52 @@ protected FileLock tryLock(final long lockPos) throws IOException { protected FileLock lock(final long lockPosition) throws Exception { long start = System.currentTimeMillis(); + boolean isRecurringFailure = false; while (!interrupted) { - FileLock lock = tryLock(lockPosition); - - if (lock == null) { - try { - Thread.sleep(500); - } catch (InterruptedException e) { - return null; + try { + FileLock lock = tryLock(lockPosition); + isRecurringFailure = false; + + if (lock == null) { + try { + Thread.sleep(500); + } catch (InterruptedException e) { + return null; + } + + if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { + throw new Exception("timed out waiting for lock"); + } + } else { + return lock; } - - if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) { - throw new Exception("timed out waiting for lock"); + } catch (IOException e) { + // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible + + logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN, + "Failure when accessing a lock file", e); + isRecurringFailure = true; + + long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME; + if (lockAcquisitionTimeout != -1) { + final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start); + if (remainingTime <= 0) { + throw new Exception("timed out waiting for lock"); Review comment: Little bit too generic, this exception, should throw something more specific. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 189469) Time Spent: 1h (was: 50m) > Backup doesn't activate after shared store is reconnected > --------------------------------------------------------- > > Key: ARTEMIS-2069 > URL: https://issues.apache.org/jira/browse/ARTEMIS-2069 > Project: ActiveMQ Artemis > Issue Type: Bug > Affects Versions: 2.6.2 > Reporter: Tomas Hofman > Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > *Scenario* > # Start live backup server pair in dedicated topology with shared store HA, > with journal located on NFS > # NFS mounted on backup server fails > # Reconnect NFS on backup server > # Try to shut down live EAP server > # Backup doesn't activate > *What happens* > Backup is waiting for live to fail by checking its file lock. In case the > connection to shared storage fails, backup logs following error. > > |{color:#000000}05:50:57,896 ERROR [org.apache.activemq.artemis.core.server] > (AMQ119000: Activation for server > ActiveMQServerImpl::serverUUID=836c9b1e-f067-11e7-8763-001b21862475) > AMQ224000: Failure in initialisation: java.io.IOException: Input/output > error{color}| > |{color:#000000} at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) > [rt.jar:1.8.0_151]{color}| > |{color:#000000} at > sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:90) > [rt.jar:1.8.0_151]{color}| > |{color:#000000} at > sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1115) > [rt.jar:1.8.0_151]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:299) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:316) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:127) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > |{color:#000000} at > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2496) > [artemis-server-1.5.5.008-redhat-1.jar:1.5.5.008-redhat-1]{color}| > | | > > Exception is caught in {{SharedStoreBackupActivation.run}}, and causes > termination of backup activation process. > In case the NFS is reconnected later, backup server doesn't continue in > activation process and it doesn't wait for live to fail. In case the live > fails, backup doesn't activate, even though it has a connection to shared > storage. > Backup should retry checking live lock even in case the storage is > unavailable. It should log warning/error messages that storage is > unavailable, but it should not terminate the activation process. This would > allow backup to continue its duties when the storage is reconnected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)