ActiveMQ 5.16.4 Data Corruption

Nordstrom, Karl Tue, 14 Jun 2022 12:48:36 -0700

Hello,

We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 1:1.8.0.332.b09-1.el7_9 
running on rhel7.


The following was done on our acceptance cluster.

I check activemq.log for messages to determine if activemq has corrupt data 
files:

[kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" 
/opt/local/activemq/data/activemq.log | head -1
2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: 
JobSchedulerStore: 
/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | 
org.apache.activemq.broker.BrokerService | main

Then I move scheduleDB files after stopping activemq.service on both brokers.

cd /opt/local/activemq/data/kahadb/scheduler

sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv scheduleDB.redo 
scheduleDB.redo.`date +%Y%m%d`

After starting ActiveMQ, 7,500,000 entries were recovered, but it failed with 
ERROR | Failed to start job scheduler store.

There was a corrupt journal file.

[kxn2@amq-a02 data]$ grep Corrupt activemq.log*

2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in 
'/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log'
 between offsets: 11558626..11559784 | 
org.apache.activemq.store.kahadb.disk.journal.Journal | main

We tried starting activemq without the db-1179.log file, with an empty 
db-1179.log file. ActiveMQ complained about both.

We eventually stopped activemq, renamed the schedule/ directory and started 
activemq.

After we restarted, we have one db-*.log file with 50K messages.

[kxn2@amq-a02 scheduler]$ wc -l db-1.log
50,067 db-1.log

Before we had 125 log files and 8.697,209 messages!

[kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
...
8,697,209 total

So, we have millions of messages that we probably do not need. It took 2.5 
hours to recover 7.5M entries before it failed; likely due to the corrupt 
record.

How can I get activemq to clean up these logs, so this recovery doesn't take so 
long?

How can I correct the data corruption?

For a test, I did remove the range of the file between offsets: 
11558626..11559784. I used the "head -c" command, grep and vi to do that. 
ActiveMQ did start.

I am hoping that this doesn't happen in production, because it won't be 
acceptable to lose messages to get activemq to start up.

---

Karl Nordström

Systems Administrator

Penn State IT | Application Platforms

ActiveMQ 5.16.4 Data Corruption

Reply via email to