So I have a problem.  User renames are

a) not atomic (in fact, any rename of a folder with subfolders is not atomic)
b) bandwidth wasteful with replication if a sync_client picks up the wrong set 
of folder names too early (it winds up deleting the old user, then having to 
copy all the messages again)

(a) is a general problem.  There are failure modes which can leave manual 
cleanup required.  I hate that.

(b) is what's causing me issues RIGHT NOW.

I sat down with Rob Mueller last Friday to talk this through, and we've come up 
with what I believe is a good solution to both these problems.  It's 
crash-safe, auto-cleanup supporting (both immediately if a folder is requested, 
and next cyr_expire run otherwise) and atomic.

This change requires the flexible mboxlist format now present in the master 
tree.  This format is a key-value format allowing arbitrary items to be stored 
in the mboxlist file.

Consider the following rename:

user.foo.sub         => user.foo.new
user.foo.sub.A       => user.foo.new.A
user.foo.sub.B       => user.foo.new.B

(it will work for all other cases as well)

First we take an exclusive namelock on user.foo.sub.  This namelock will be 
held for the entire time.

Next we take an exclusive lock on mailboxes.db, and insert/replace the 
following records:

user.foo.sub         %(NAMELOCK user.foo.sub TYPE RENAMELOCKED)
user.foo.sub.A       %(NAMELOCK user.foo.sub TYPE RENAMELOCKED)
user.foo.sub.B       %(NAMELOCK user.foo.sub TYPE RENAMELOCKED)
user.foo.new         %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.new.A       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.new.B       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)

And we release the mailboxes.db lock, allowing the rest of the server to run 
happily.

We then create the on-disk directories for user.foo.new and friends, and we 
copy EVERYTHING
into the new locations, including building the cyrus.index files, linking the 
spool, etc.

Any other process which tries to open any of these mailboxes will see the 'TYPE 
RENAMELOCKED' field and block on the locking the NAMELOCK field's lock file 
until the rename is either finished or aborted.  Because they block, there are 
no spurious errors returned to clients during the rename.

A successful rename - after all the files are in place, we take an exclusive 
lock on the mailboxes.db again and in an atomic transaction we make the 
following changes:

user.foo.sub         %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.sub.A       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.sub.B       %(NAMELOCK user.foo.sub TYPE RENAMETEMP)
user.foo.new         %()
user.foo.new.A       %()
user.foo.new.B       %()

So the new folders are now ready to use, but this process is still holding the 
old folders, and still holding the old namelock.

At this point, we go through all the old folder and delete the on-disk files.  
Once that's done, we can do a single update in the mailboxes.db:

user.foo.sub         %(TYPE DELETED)

(we keep DELETED tombstones in mailboxes.db now to ensure that UIDVALIDITY 
never gets reused, but also to detect the difference between "folder created on 
A" and "folder deleted on B" in a multi-master replication setup)

-------------

A failed rename - if, at any point before the atomic rename updates happen, 
there is an error or the process doing the rename crashes, there will be 
potentially files on disk in the destination folders, and there will be the 
initial records in the mailboxes.db.

If any other process tries to open one of those folders (including cyr_expire, 
which visits every folder) - it will attempt to get the namelock on the source 
root as per the NAMELOCK field in mailboxes.db.  When it obtains that lock, it 
will check to see if another process finished the cleanup first.  If not, it 
will either:

a) for RENAMELOCKED - just update the mailboxes.db to say that the folder is 
nolonger renamelocked, then go about its business.
b) for RENAMETEMP - delete all the files on disk, and remove the record from 
the mailboxes.db (or restore the old DELETED record if there was one)

At which point, cleanup is completed.

This will work no matter what the rename (though there might need to be some 
extra magic added for cross-partition renames to ensure we can clean them up 
safely too, since they don't have a second mailboxes.db entry).

In the interests of replication speed, we MAY convert sync_client to trylock 
rather than locking in these cases, and if it fails it will just insert the 
mailbox into the next synclog and then continue with other mailboxes.  I 
definitely need to make it run the 'USER' sync earlier and add all the 
mailboxes found in that into the general pool of mailboxes so that it detects 
user renames better anyway.

Bron.

-- 
  Bron Gondwana
  br...@fastmail.fm

Reply via email to