NFS problems can cause broker to hang on shutdown requiring a hard kill
------------------------------------------------------------------------
Key: AMQ-2595
URL: https://issues.apache.org/activemq/browse/AMQ-2595
Project: ActiveMQ
Issue Type: New Feature
Components: Broker
Affects Versions: 5.3.0
Reporter: Bruce Snyder
When the ActiveMQ data directory exists on a NFS mounted volume and that volume
experiences some type of failure, it can cause the broker to hang and only a
manual kill will shut it down. Below is a sequence of errors demonstrating this:
{code}
ERROR | Failed to fill batch
java.io.IOException: I/O error
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at
org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:90)
at org.apache.kahadb.journal.Journal.read(Journal.java:573)
at
org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:656)
at
org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:534)
at
org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:229)
at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
at
org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:222)
at
org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
at
org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1363)
at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1503)
at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1178)
at
org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
at
org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
ERROR | Failed to fill batch
java.lang.RuntimeException: java.io.IOException: I/O error
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:230)
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
at
org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1363)
at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1503)
at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1178)
at
org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
at
org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: I/O error
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at
org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:90)
at org.apache.kahadb.journal.Journal.read(Journal.java:573)
at
org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:656)
at
org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:534)
at
org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:229)
at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
at
org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:222)
at
org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
... 10 more
ERROR | Failed to page in more queue messages
java.lang.RuntimeException: java.lang.RuntimeException:
java.io.IOException: I/O error
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:102)
at
org.apache.activemq.broker.region.cursors.StoreQueueCursor.reset(StoreQueueCursor.java:157)
at org.apache.activemq.broker.region.Queue.doPageIn(Queue.java:1363)
at org.apache.activemq.broker.region.Queue.pageInMessages(Queue.java:1503)
at org.apache.activemq.broker.region.Queue.iterate(Queue.java:1178)
at
org.apache.activemq.thread.DeterministicTaskRunner.runTask(DeterministicTaskRunner.java:84)
at
org.apache.activemq.thread.DeterministicTaskRunner$1.run(DeterministicTaskRunner.java:41)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: java.io.IOException: I/O error
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:230)
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.reset(AbstractStoreCursor.java:99)
... 9 more
Caused by: java.io.IOException: I/O error
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:381)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at
org.apache.kahadb.journal.DataFileAccessor.readRecord(DataFileAccessor.java:90)
at org.apache.kahadb.journal.Journal.read(Journal.java:573)
at
org.apache.activemq.store.kahadb.MessageDatabase.load(MessageDatabase.java:656)
at
org.apache.activemq.store.kahadb.KahaDBStore.loadMessage(KahaDBStore.java:534)
at
org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore$4.execute(KahaDBStore.java:229)
at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
at
org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.recoverNextMessages(KahaDBStore.java:222)
at
org.apache.activemq.broker.region.cursors.QueueStorePrefetch.doFillBatch(QueueStorePrefetch.java:81)
at
org.apache.activemq.broker.region.cursors.AbstractStoreCursor.fillBatch(AbstractStoreCursor.java:227)
... 10 more
java.io.IOException: I/O error
at java.io.RandomAccessFile.write(Native Method)
at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:967)
at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:974)
at org.apache.kahadb.page.PageFile.flush(PageFile.java:443)
at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:947)
at
org.apache.activemq.store.kahadb.MessageDatabase.access$200(MessageDatabase.java:79)
at
org.apache.activemq.store.kahadb.MessageDatabase$9.execute(MessageDatabase.java:587)
at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:585)
at
org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:256)
Exception in thread "ActiveMQ Journal Checkpoint Worker"
java.lang.NullPointerException
at java.util.zip.Adler32.update(Adler32.java:44)
at org.apache.kahadb.page.PageFile.writeBatch(PageFile.java:962)
at org.apache.kahadb.page.PageFile.flush(PageFile.java:443)
at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:947)
at
org.apache.activemq.store.kahadb.MessageDatabase.access$200(MessageDatabase.java:79)
at
org.apache.activemq.store.kahadb.MessageDatabase$9.execute(MessageDatabase.java:587)
at org.apache.kahadb.page.Transaction.execute(Transaction.java:728)
at
org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:585)
at
org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:256)
^C INFO | ActiveMQ Message Broker (dynamic-broker1,
ID:tssol10s-32974-1265131581698-0:0) is shutting down
INFO | Connector vm://dynamic-broker1 Stopped
INFO | dynamic-broker1 bridge to dynamic-broker2 stopped
INFO | Network Connector localhost Stopped
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
INFO | The connection to '/127.0.0.1:32982' is taking a long time to shutdown.
{code}
In this case, a problem occurred with the NFS mount point where the data
directory lives which resulted in the I/O error. After sending a ctrl-c to shut
down the broker, it would not shut down on it's own so it had to be killed.
Only then could failover take place allowing another broker could grab the lock.
Given that the broker cannot recover from such a problem, it should probably
force itself to shut down so that failover can take place. Also, such a failure
should be a FATAL and not an ERROR in the logging.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.