[ https://issues.apache.org/jira/browse/AMQ-7080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16660840#comment-16660840 ]
Alan Protasio commented on AMQ-7080: ------------------------------------ [~gtully] Cool! Thanks!! I saw that it's already fixed! That was fast! :D Did you have a chance to look at this PR? Cheers.. > Keep track of free pages - Update db.free file during checkpoints > ----------------------------------------------------------------- > > Key: AMQ-7080 > URL: https://issues.apache.org/jira/browse/AMQ-7080 > Project: ActiveMQ > Issue Type: Improvement > Components: KahaDB > Affects Versions: 5.15.6 > Reporter: Alan Protasio > Assignee: Jean-Baptiste Onofré > Priority: Major > Fix For: 5.16.0, 5.15.7 > > Attachments: AMQ-7080-freeList-update.diff > > > In a event of an unclean shutdown, Activemq loses the information about the > free pages in the index. In order to recover this information, ActiveMQ read > the whole index during shutdown searching for free pages and then save the > db.free file. This operation can take a long time, making the failover > slower. (during the shutdown, activemq will still hold the lock). > From http://activemq.apache.org/shared-file-system-master-slave.html > {quote}"If you have a SAN or shared file system it can be used to provide > high availability such that if a broker is killed, another broker can take > over immediately." > {quote} > Is important to note if the shutdown takes more than ACTIVEMQ_KILL_MAXSECONDS > seconds, any following shutdown will be unclean. This broker will stay in > this state unless the index is deleted (this state means that every failover > will take more then ACTIVEMQ_KILL_MAXSECONDS, so, if you increase this time > to 5 minutes, you fail over can take more than 5 minutes). > > In order to prevent ActiveMQ reading the whole index file to search for free > pages, we can keep track of those on every Checkpoint. In order to do that we > need to be sure that db.data and db.free are in sync. To achieve that we can > have a attribute in the db.free page that is referenced by the db.data. > So during the checkpoint we have: > 1 - Save db.free and give a freePageUniqueId > 2 - Save this freePageUniqueId in the db.data (metadata) > In a crash, we can see if the db.data has the same freePageUniqueId as the > db.free. If this is the case we can safely use the free page information > contained in the db.free > Now, the only way to read the whole index file again is IF the crash happens > btw step 1 and 2 (what is very unlikely). > The drawback of this implementation is that we will have to save db.free > during the checkpoint, what can possibly increase the checkpoint time. > Is also important to note that we CAN (and should) have stale data in db.free > as it is referencing stale db.data: > Imagine the timeline: > T0 -> P1, P2 and P3 are free. > T1 -> Checkpoint > T2 -> P1 got occupied. > T3 -> Crash > In the current scenario after the Pagefile#load the P1 will be free and then > the replay will mark P1 as occupied or will occupied another page (now that > the recovery of free pages is done on shutdown) > This change only make sure that db.data and db.free are in sync and showing > the reality in T1 (checkpoint), If they are in sync we can trust the db.free. > This is a really fast draft of what i'm suggesting... If you guys agree, i > can create the proper patch after: > [https://github.com/alanprot/activemq/commit/18036ef7214ef0eaa25c8650f40644dd8b4632a5] > > This is related to https://issues.apache.org/jira/browse/AMQ-6590 -- This message was sent by Atlassian JIRA (v7.6.3#76005)