Hi Karl, IMHO, the problem is related to the storage. The only time I saw such corruption was due to a script that zip all log files in kahabd (the guys thought it was log files and not journal files).
Double check your storage, maybe add lease locker config, I think the problem is more around that. Regards JB On Thu, Jun 16, 2022 at 7:58 PM Nordstrom, Karl <k...@psu.edu> wrote: > > Matt, et al., > > Kahadb is on shared NFS scaled out NAS storage. Sometimes, ActiveMQ loses its > NFS mounts when the Storage Team upgrades the OS on the storage nodes. They > upgrade one node at a time. The NFS mount must migrate to another storage > node. Supposedly, It can take up to 30 seconds to migrate. The IP address of > the new storage node is different than the original storage node. We avoid > data corruption by stopping activemq.service on the broker that is in slave > mode during the storage upgrade. > > Unfortunatly, I did not check for I/O errors earlier. I don't have > /var/log/messages before 2022/06/10. If this happens again I will certainly > follow your advise. > > > --- > > Karl Nordström > > Systems Administrator > > Penn State IT | Application Platforms > > ________________________________ > From: Matt Pavlovich <mattr...@gmail.com> > Sent: Wednesday, June 15, 2022 6:24 PM > To: users@activemq.apache.org <users@activemq.apache.org> > Subject: Re: ActiveMQ 5.16.4 Data Corruption > > Karl- > > Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was > no disk corruption-- check your syslog and dmesg output for anyI/O error > messages. > > Thanks, > Matt > > > On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <k...@psu.edu> wrote: > > > > Hello, > > > > We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 > > 1:1.8.0.332.b09-1.el7_9 running on rhel7. > > > > The following was done on our acceptance cluster. > > > > I check activemq.log for messages to determine if activemq has corrupt data > > files: > > > > [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" > > /opt/local/activemq/data/activemq.log | head -1 > > 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: > > JobSchedulerStore: > > /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | > > org.apache.activemq.broker.BrokerService | main > > > > Then I move scheduleDB files after stopping activemq.service on both > > brokers. > > > > cd /opt/local/activemq/data/kahadb/scheduler > > > > sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv > > scheduleDB.redo scheduleDB.redo.`date +%Y%m%d` > > > > After starting ActiveMQ, 7,500,000 entries were recovered, but it failed > > with ERROR | Failed to start job scheduler store. > > > > There was a corrupt journal file. > > > > [kxn2@amq-a02 data]$ grep Corrupt activemq.log* > > > > 2022-06-02 07:55:40,066 | WARN | Corrupt journal records found in > > '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log' > > between offsets: 11558626..11559784 | > > org.apache.activemq.store.kahadb.disk.journal.Journal | main > > > > We tried starting activemq without the db-1179.log file, with an empty > > db-1179.log file. ActiveMQ complained about both. > > > > We eventually stopped activemq, renamed the schedule/ directory and started > > activemq. > > > > After we restarted, we have one db-*.log file with 50K messages. > > > > [kxn2@amq-a02 scheduler]$ wc -l db-1.log > > 50,067 db-1.log > > > > Before we had 125 log files and 8.697,209 messages! > > > > [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log > > ... > > 8,697,209 total > > > > So, we have millions of messages that we probably do not need. It took 2.5 > > hours to recover 7.5M entries before it failed; likely due to the corrupt > > record. > > > > How can I get activemq to clean up these logs, so this recovery doesn't > > take so long? > > > > How can I correct the data corruption? > > > > For a test, I did remove the range of the file between offsets: > > 11558626..11559784. I used the "head -c" command, grep and vi to do that. > > ActiveMQ did start. > > > > I am hoping that this doesn't happen in production, because it won't be > > acceptable to lose messages to get activemq to start up. > > > > --- > > > > Karl Nordström > > > > Systems Administrator > > > > Penn State IT | Application Platforms >