Re: ActiveMQ 5.16.4 Data Corruption

Jean-Baptiste Onofré Sun, 19 Jun 2022 06:52:51 -0700

Hi Karl,

IMHO, the problem is related to the storage. The only time I saw such
corruption was due to a script that zip all log files in kahabd (the
guys thought it was log files and not journal files).


Double check your storage, maybe add lease locker config, I think the
problem is more around that.

Regards
JB

On Thu, Jun 16, 2022 at 7:58 PM Nordstrom, Karl <k...@psu.edu> wrote:
>
> Matt, et al.,
>
> Kahadb is on shared NFS scaled out NAS storage. Sometimes, ActiveMQ loses its 
> NFS mounts when the Storage Team upgrades the OS on the storage nodes. They 
> upgrade one node at a time. The NFS mount must migrate to another storage 
> node. Supposedly, It can take up to 30 seconds to migrate. The IP address of 
> the new storage node is different than the original storage node. We avoid 
> data corruption by stopping activemq.service on the broker that is in slave 
> mode during the storage upgrade.
>
> Unfortunatly, I did not check for I/O errors earlier. I don't have 
> /var/log/messages before 2022/06/10. If this happens again I will certainly 
> follow your advise.
>
>
> ---
>
> Karl Nordström
>
> Systems Administrator
>
> Penn State IT | Application Platforms
>
> ________________________________
> From: Matt Pavlovich <mattr...@gmail.com>
> Sent: Wednesday, June 15, 2022 6:24 PM
> To: users@activemq.apache.org <users@activemq.apache.org>
> Subject: Re: ActiveMQ 5.16.4 Data Corruption
>
> Karl-
>
> Is this on a local disk, RAID, SAN or NAS? First step is to confirm there was 
> no disk corruption-- check your syslog and dmesg output for anyI/O error 
> messages.
>
> Thanks,
> Matt
>
> > On Jun 14, 2022, at 2:48 PM, Nordstrom, Karl <k...@psu.edu> wrote:
> >
> > Hello,
> >
> > We have activemq-5.16.4 and java-1.8.0-openjdk.x86_64 
> > 1:1.8.0.332.b09-1.el7_9 running on rhel7.
> >
> > The following was done on our acceptance cluster.
> >
> > I check activemq.log for messages to determine if activemq has corrupt data 
> > files:
> >
> > [kxn2@amq-a02 scheduler]$ sudo grep "Failed to start job scheduler store" 
> > /opt/local/activemq/data/activemq.log | head -1
> > 2022-06-03 16:00:46,670 | ERROR | Failed to start job scheduler store: 
> > JobSchedulerStore: 
> > /opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler | 
> > org.apache.activemq.broker.BrokerService | main
> >
> > Then I move scheduleDB files after stopping activemq.service on both 
> > brokers.
> >
> > cd /opt/local/activemq/data/kahadb/scheduler
> >
> > sudo mv scheduleDB.data scheduleDB.data.`date +%Y%m%d`; sudo mv 
> > scheduleDB.redo scheduleDB.redo.`date +%Y%m%d`
> >
> > After starting ActiveMQ, 7,500,000 entries were recovered, but it failed 
> > with ERROR | Failed to start job scheduler store.
> >
> > There was a corrupt journal file.
> >
> > [kxn2@amq-a02 data]$ grep Corrupt activemq.log*
> >
> > 2022-06-02 07:55:40,066 | WARN  | Corrupt journal records found in 
> > '/opt/local/apache-activemq-5.16.4/data/amq-acceptance-cluster/scheduler/db-1179.log'
> >  between offsets: 11558626..11559784 | 
> > org.apache.activemq.store.kahadb.disk.journal.Journal | main
> >
> > We tried starting activemq without the db-1179.log file, with an empty 
> > db-1179.log file. ActiveMQ complained about both.
> >
> > We eventually stopped activemq, renamed the schedule/ directory and started 
> > activemq.
> >
> > After we restarted, we have one db-*.log file with 50K messages.
> >
> > [kxn2@amq-a02 scheduler]$ wc -l db-1.log
> > 50,067 db-1.log
> >
> > Before we had 125 log files and 8.697,209 messages!
> >
> > [kxn2@amq-a02 scheduler.bkup]$ wc -l db-*.log
> > ...
> > 8,697,209 total
> >
> > So, we have millions of messages that we probably do not need. It took 2.5 
> > hours to recover 7.5M entries before it failed; likely due to the corrupt 
> > record.
> >
> > How can I get activemq to clean up these logs, so this recovery doesn't 
> > take so long?
> >
> > How can I correct the data corruption?
> >
> > For a test, I did remove the range of the file between offsets: 
> > 11558626..11559784. I used the "head -c" command, grep and vi to do that. 
> > ActiveMQ did start.
> >
> > I am hoping that this doesn't happen in production, because it won't be 
> > acceptable to lose messages to get activemq to start up.
> >
> > ---
> >
> > Karl Nordström
> >
> > Systems Administrator
> >
> > Penn State IT | Application Platforms
>

Re: ActiveMQ 5.16.4 Data Corruption

Reply via email to