Re: [Dovecot] How to get rid of locks

2007-05-14 Thread Timo Sirainen
On Sat, 2007-04-07 at 22:30 +0300, Timo Sirainen wrote:
 I just figured out that O_APPEND is pretty great. If the operating  
 system updates seek position after writing to a file opened with  
 O_APPEND, writes to Dovecot's transaction log file can be made  
 lockless.

Well, almost. Log rotation isn't possible without some sort of locking.
But the locks could still be reduced:

 - Normally keep the .log file read-locked all the time (multiple
processes can have it read-locked)
 - Write to it with O_APPEND
 - If you notice that the log is going to be rotated soon, drop the read
lock and acquire it only for the duration of appends
 - When the log is wanted to be rotated, try to get a write-lock. If it
fails, try again later. If it succeeds, it's safe to rotate the log.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] How to get rid of locks

2007-04-08 Thread Daniel L. Miller

Timo Sirainen wrote:
Although Dovecot is already read-lockless and it uses only short-lived 
write locks, it's be really nice to just get rid of the locking 
completely. :)


I just figured out that O_APPEND is pretty great. If the operating 
system updates seek position after writing to a file opened with 
O_APPEND, writes to Dovecot's transaction log file can be made 
lockless. I see that this works with Linux and Solaris, but not with 
OS X. Could you BSD people try if it works there? 
http://dovecot.org/tmp/append.c and see if it says offset = 0 (bad) 
or non-zero (yay). The O_APPEND at least doesn't work with NFS, so 
it'll have to be optional anyway.


Currently Dovecot always updates dovecot.index file after it has done 
any changes. This isn't really necessary, because the changes are 
already in transaction log, so the dovecot.index file can be read to 
memory and the new changes applied on top of it from transaction log 
(this is pretty much how mmap_disable=yes works). So I'm going to 
change this to work so that the dovecot.index is updated only if a) 
there are enough changes in transaction log (eg. 8kB or so) and b) it 
can be write-locked without waiting.


Maildir then. It has this annoying problem that readdir() can skip 
files if another process is rename()ing them, causing Dovecot to think 
that the message was expunged. The only way I could avoid this by 
locking the maildir while synchronizing it. Today I noticed that this 
doesn't happen with OS X. I'm not sure if I was just lucky or if there 
really is something special implemented in it, because it doesn't work 
anywhere else. I'm not sure if this is tied to HFS+, or if it will 
work with zfs also (Solaris+zfs didn't work). So perhaps the locking 
could be disabled while running with OS X.


More importantly I figured out that it can also be avoided with 
Linux+inotify. As long as the inotify event buffer doesn't overflow, 
the full list of files can be read by combining the readdir() output 
and files listed by inotify events. If the inotify buffer overflows 
(highly unlikely), the operation can just be retried and it most 
likely works the next time.


So with these changes in place, changing a message flag or expunging a 
message would usually result in:


 - lockless write() call to dovecot.index.log
 - lockless read()ing (or looking into mmaped) dovecot.index.log to 
see if there's some new data besides what we just wrote that needs to 
be synchronized to maildir
 - rename() or unlink() calls to maildir. If a call return ENOENT, the 
maildir needs to be readdir()ed with inotify enabled to find the new 
filename.


Not a single lock in the operation, assuming that dovecot.index file 
wasn't updated.


Assigning UIDs to newly delivered mails would require locking though. 
dovecot-uidlist needs to be locked, and the UIDs need to be written to 
dovecot.index.log file in the correct order, which can also be done 
with dovecot-uidlist locking.


Actually a single write() to dovecot.index.log isn't enough. I think 
there needs to be some kind of a flag written to the beginning of the 
transaction which marks the transaction as truly finished. If the flag 
isn't there, any reader knows to stop and wait until the flag is set. 
So this means that the writer needs to:


1. Do a single O_APPENDed write() call writing the whole transaction
2. Get the current offset with lseek(fd, 0, SEEK_CUR) (this is what 
the append.c tester checks)
3. pwrite() the finished-flag to beginning of the transaction Except 
at least with Linux pwrite() doesn't work if O_APPEND is enabled. 
There are two ways to work around this:

 a) fcntl(disable O_APPEND) + pwrite() + fcntl(enable O_APPEND)
 b) Keep two file descriptors open for the transaction log. First with 
O_APPEND flag and second without. pwrite() to the second one.


a) is probably better because it doesn't waste file descriptors.
This is probably a scary thought, but . . . what would it take for the 
indexing part of Dovecot to be implemented via an API/plug-in model?  
I'm curious about the effect of using an external SQL engine (my vote 
would be Firebird) for processing these, and using a open plug-in method 
would allow for that without binding Dovecot to a particular implementation.


--
Daniel



Re: [Dovecot] How to get rid of locks

2007-04-08 Thread Miquel van Smoorenburg
On Sat, 2007-04-07 at 22:30 +0300, Timo Sirainen wrote:
 Although Dovecot is already read-lockless and it uses only short- 
 lived write locks, it's be really nice to just get rid of the locking  
 completely. :)
 
 I just figured out that O_APPEND is pretty great. If the operating  
 system updates seek position after writing to a file opened with  
 O_APPEND, writes to Dovecot's transaction log file can be made  
 lockless.

Doest his mean there's even less chance of indexes working on NFS (where
O_APPEND doesn't really work) ?

That's a pity, as a lot of larger sites use NFS. They are already forced
to use indexes on local disk - and dovecot seems to rely on indexes more
and more (fulltext index, shared mailboxes, etc).

And mailbox formats like dbox don't even work without an index.

I had hoped that the existing index code would be made more reliable and
network-filesystem safe first.

Mike.



Re: [Dovecot] How to get rid of locks

2007-04-08 Thread Timo Sirainen

On 8.4.2007, at 12.41, Miquel van Smoorenburg wrote:


On Sat, 2007-04-07 at 22:30 +0300, Timo Sirainen wrote:

Although Dovecot is already read-lockless and it uses only short-
lived write locks, it's be really nice to just get rid of the locking
completely. :)

I just figured out that O_APPEND is pretty great. If the operating
system updates seek position after writing to a file opened with
O_APPEND, writes to Dovecot's transaction log file can be made
lockless.


Doest his mean there's even less chance of indexes working on NFS  
(where

O_APPEND doesn't really work) ?


No. I haven't forgotten NFS users. You missed this part:

The O_APPEND at least doesn't work with NFS, so it'll have to be  
optional anyway.


I'm now trying to think of ways to simplify the index file handling.  
That allows me to then implement NFS workarounds more easily, such as  
forcing attribute cache flushing when it's needed.


PGP.sig
Description: This is a digitally signed message part


[Dovecot] How to get rid of locks

2007-04-07 Thread Timo Sirainen
Although Dovecot is already read-lockless and it uses only short- 
lived write locks, it's be really nice to just get rid of the locking  
completely. :)


I just figured out that O_APPEND is pretty great. If the operating  
system updates seek position after writing to a file opened with  
O_APPEND, writes to Dovecot's transaction log file can be made  
lockless. I see that this works with Linux and Solaris, but not with  
OS X. Could you BSD people try if it works there? http://dovecot.org/ 
tmp/append.c and see if it says offset = 0 (bad) or non-zero (yay).  
The O_APPEND at least doesn't work with NFS, so it'll have to be  
optional anyway.


Currently Dovecot always updates dovecot.index file after it has done  
any changes. This isn't really necessary, because the changes are  
already in transaction log, so the dovecot.index file can be read to  
memory and the new changes applied on top of it from transaction log  
(this is pretty much how mmap_disable=yes works). So I'm going to  
change this to work so that the dovecot.index is updated only if a)  
there are enough changes in transaction log (eg. 8kB or so) and b) it  
can be write-locked without waiting.


Maildir then. It has this annoying problem that readdir() can skip  
files if another process is rename()ing them, causing Dovecot to  
think that the message was expunged. The only way I could avoid this  
by locking the maildir while synchronizing it. Today I noticed that  
this doesn't happen with OS X. I'm not sure if I was just lucky or if  
there really is something special implemented in it, because it  
doesn't work anywhere else. I'm not sure if this is tied to HFS+, or  
if it will work with zfs also (Solaris+zfs didn't work). So perhaps  
the locking could be disabled while running with OS X.


More importantly I figured out that it can also be avoided with Linux 
+inotify. As long as the inotify event buffer doesn't overflow, the  
full list of files can be read by combining the readdir() output and  
files listed by inotify events. If the inotify buffer overflows  
(highly unlikely), the operation can just be retried and it most  
likely works the next time.


So with these changes in place, changing a message flag or expunging  
a message would usually result in:


 - lockless write() call to dovecot.index.log
 - lockless read()ing (or looking into mmaped) dovecot.index.log to  
see if there's some new data besides what we just wrote that needs to  
be synchronized to maildir
 - rename() or unlink() calls to maildir. If a call return ENOENT,  
the maildir needs to be readdir()ed with inotify enabled to find the  
new filename.


Not a single lock in the operation, assuming that dovecot.index file  
wasn't updated.


Assigning UIDs to newly delivered mails would require locking though.  
dovecot-uidlist needs to be locked, and the UIDs need to be written  
to dovecot.index.log file in the correct order, which can also be  
done with dovecot-uidlist locking.


Actually a single write() to dovecot.index.log isn't enough. I think  
there needs to be some kind of a flag written to the beginning of the  
transaction which marks the transaction as truly finished. If the  
flag isn't there, any reader knows to stop and wait until the flag is  
set. So this means that the writer needs to:


1. Do a single O_APPENDed write() call writing the whole transaction
2. Get the current offset with lseek(fd, 0, SEEK_CUR) (this is what  
the append.c tester checks)
3. pwrite() the finished-flag to beginning of the transaction Except  
at least with Linux pwrite() doesn't work if O_APPEND is enabled.  
There are two ways to work around this:

 a) fcntl(disable O_APPEND) + pwrite() + fcntl(enable O_APPEND)
 b) Keep two file descriptors open for the transaction log. First  
with O_APPEND flag and second without. pwrite() to the second one.


a) is probably better because it doesn't waste file descriptors.


PGP.sig
Description: This is a digitally signed message part


Re: [Dovecot] How to get rid of locks

2007-04-07 Thread Mark E. Mallett
On Sat, Apr 07, 2007 at 10:30:25PM +0300, Timo Sirainen wrote:
 
 I just figured out that O_APPEND is pretty great. If the operating  
 system updates seek position after writing to a file opened with  
 O_APPEND, writes to Dovecot's transaction log file can be made  
 lockless. I see that this works with Linux and Solaris, but not with  
 OS X. Could you BSD people try if it works there? http://dovecot.org/ 
 tmp/append.c and see if it says offset = 0 (bad) or non-zero (yay).  

FreeBSD 5.2: 5,10,15 etc, so yay
ancient BSD/OS:  ditto
[my FreeBSD 6.2 system is unavailable at the moment, but I can't imagine
that they broke it there]

mm


Re: [Dovecot] How to get rid of locks

2007-04-07 Thread Nils Vogels
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Timo Sirainen wrote on 7-4-2007 21:30:
 Could you BSD people try if it works there?
 http://dovecot.org/tmp/append.c and see if it says offset = 0
 (bad) or non-zero (yay). The O_APPEND at least doesn't work with
 NFS, so it'll have to be optional anyway.
5.4-RELEASE-p6: yay
6.0-RELEASE: yay

Greets,

Nils
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
 
iD8DBQFGGDMqMzNX/a06Wq0RAkhiAJ9RjtMRHDRASuHiIrCxmPTJZZ1MFwCfasOR
6W2/mjFuPyf7jbTQfe6zpII=
=Y9Zk
-END PGP SIGNATURE-