Re: [Dovecot] dbox redesign

2009-02-17 Thread Steffen Kaiser

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, 12 Feb 2009, Allen Belletti wrote:


I would add that having fewer, larger files should make backups much
more feasible.  There's a certain amount of overhead for each file


That's true for full backups. I don't defend Maildir, esp. because it 
changes the filename, which is a new file for any backup software (which 
are usually not Maildir aware in my case).



operation (especially for us GFS people!) and reducing the number of
files will reduce that overhead.


Also, makes partial recoveries problematic.


Right now our backups (done via rsync) take a pretty scary amount of


Yep.

Bye,

- -- 
Steffen Kaiser

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSZqWHHWSIuGy1ktrAQL7lggAnWUfIIPwVj3xV4csIVl0ayVCn2lgBGzG
lRIg+OzGRbpZx9uvwJpPRtJS7TphFTBmctvdvL22NROGaXJh0bmvdfgeXnUf+IG/
YJqNEGN5j/yUAHON4l9hMnv9JjgWwSIFKCUZJ7MFVJpohpPLXJoxDt+AYyb+d+44
GImHDgpcyr0089Asv6FN8Q4rzGIxQAvdIx/n/nMeQ77ZVnbTJDtrcuNUywcV7Hqq
lYbEX83ikq206QSJjmwM1j6w5n+PAsHWE8UJdmmpP/7vemsg3KDVkhaMcCfhLyL4
FqDZOwuhhsVEjykLfgbx6onJ1bon7u987eqJt5yv1NYRGU+NoCVUDQ==
=+s98
-END PGP SIGNATURE-


Re: [Dovecot] dbox redesign

2009-02-12 Thread Mikkel

Hi Timo

I have a few comments. Please just disregard them if I have 
misunderstood your design.


Regarding your storage plan
I find it very important that users can be stored in different locations 
because:
1. Discount users could be placed on cheap storage while others are 
offered premium service on expensive hardware
2. It's easy to scale if you just add another LUN from your SAN or mount 
from NAS
3. In order to avoing huge directories you can put users into subdirs 
with each subdir containing only say 1000 users each
All this is very easy to achieve in 1.1 because you can return 
individual storage dirs for indexes and data from the user db.
I'm not sure from reading your post whether this will still be possible 
but I believe it’s a very important thing.



Regarding 7.
I very much for all the self healing you describe.
There is nothing worse than huge complex systems that fail just because 
of some minor error that could easily be fixed without manual intervention.

But also I'm a little worried in this regard.

Maildir is so robust that nothing can really go wrong. But here you have 
index files and data files located in different places.
Imagine the index file being on one NFS mount whilst the data resides on 
another.
Or if the administrator is purposely loading a different index file or 
data file from a backup.


Worst case scenario is that the self healing takes a manual operation 
for a failure and breaks something.
It should be very resilient to temporarily losing access to all files in 
this operation (could happen very often on NFS mounts).


Also I imagine the self-healing going into loops if it doesn't 
understand what’s going on.
If the data changes dues to manual intervention or par of the file 
system can be accessed you could imagine the self healing process trying 
again and again to fix something that isn't its job to fix.

In that case it would be better if it just skipped the apparent failures.


Timo wrote:
I'm also wondering if it's better for each mailbox to have its separate
dovecot.index.cache file or if there should be one cache file for the
map index.
I think you should consider more files as the general choice (not only 
regarding cache files).
Imagine many dovecot servers accessing the same storage simultaneously. 
I figure it would be a lot easier if they weren’t all trying to 
read/update one essential file at the same time (with only one file, 
load can’t be spread across multiple mounts and everything goes down if 
the mount with the essential file is inaccessible).
If there is serious data corruption and you have only one file then all 
operations are paused while the self healing is trying to figure out 
what went wrong (and what happens if different servers decide to do 
self-healing on this one file at the same time?).
With one file per maildir only a small portion of the users are 
affected, the load is spread and really bad file corruption doesn’t 
break everything for thousands of users.


Other than that I’m just really glad that dbox is progressing. I 
consider it the feature.
Dbox is the email administrator’s wet dream. I’m already dreaming of 
completely avoiding the scalability issues of large Maildirs (which is 
the biggest challenge today in my opinion) and reducing the IO. Buying 
more IO is an order of magnitude more expensive than getting more RAM or 
CPU power (and dovecot barely needs any RAM and CPU anyway).


Best wishes, Mikkel



Re: [Dovecot] dbox redesign

2009-02-12 Thread Allen Belletti

I would add that having fewer, larger files should make backups much
more feasible.  There's a certain amount of overhead for each file
operation (especially for us GFS people!) and reducing the number of
files will reduce that overhead.

Right now our backups (done via rsync) take a pretty scary amount of
time, only to get worse as the size of the mailstore (currently 200G) grows.

Personally I'm pretty excited about dbox.

Allen

Timo Sirainen wrote:

On Wed, 2009-02-11 at 14:32 -0800, Seth Mattinen wrote:
  

Timo Sirainen wrote:


This is about how to implement multiple msgs/file dbox format. The
current v1.1's one msg/file design would stay pretty much the same and
it would be compatible with this new design.

  

Out of curiosity, what's the advantage to going to multiple messages per
file? Wouldn't this have the same problems as mbox?



Multiple per file, not everything in one file. As long as the file size
is set right, it's probably faster than one per file. We'll see :)

  


--
Allen Belletti
al...@isye.gatech.edu 404-894-6221 Phone
Industrial and Systems Engineering404-385-2988 Fax
Georgia Institute of Technology


Re: [Dovecot] dbox redesign

2009-02-12 Thread Timo Sirainen
On Thu, 2009-02-12 at 11:29 +0100, Mikkel wrote:
 Hi Timo
 
 I have a few comments. Please just disregard them if I have 
 misunderstood your design.
 
 Regarding your storage plan
 I find it very important that users can be stored in different locations 
 because:

This you misunderstood. The mails of a single user are stored in one
dbox directory, not all users.

 Regarding 7.
 I very much for all the self healing you describe.
 There is nothing worse than huge complex systems that fail just because 
 of some minor error that could easily be fixed without manual intervention.
 But also I'm a little worried in this regard.
 
 Maildir is so robust that nothing can really go wrong.

Yes. If you don't care that much about performance Maildir is going to
be more reliable, especially when recovering from filesystem corruption.

 It should be very resilient to temporarily losing access to all files in 
 this operation (could happen very often on NFS mounts).

I/O errors and such are treated differently than corrupted/missing
files. So as long as reading gives an error it doesn't try to repair
anything.

 Also I imagine the self-healing going into loops if it doesn't 
 understand what’s going on.
 If the data changes dues to manual intervention or par of the file 
 system can be accessed you could imagine the self healing process trying 
 again and again to fix something that isn't its job to fix.
 In that case it would be better if it just skipped the apparent failures.

I'm not really sure what you're thinking about here. Assuming there
aren't bugs in the fixup code, it should be able to fix things. If
someone manually goes and breaks things again, then sure it fixes them
again later, but there's really no automatic looping. Also Dovecot
already does index file fixing if it notices corruption, so this won't
be all that much different.

 If there is serious data corruption and you have only one file then all 
 operations are paused while the self healing is trying to figure out 
 what went wrong

There will be multiple files even per user, but yes, if corruption is
noticed then the user is blocked until the corruption is fixed.

  (and what happens if different servers decide to do 
 self-healing on this one file at the same time?).

The same as if two processes in one server decide to self-heal: Locking
prevents it from happening.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] dbox redesign

2009-02-11 Thread Seth Mattinen
Timo Sirainen wrote:
 This is about how to implement multiple msgs/file dbox format. The
 current v1.1's one msg/file design would stay pretty much the same and
 it would be compatible with this new design.
 

Out of curiosity, what's the advantage to going to multiple messages per
file? Wouldn't this have the same problems as mbox?

~Seth


Re: [Dovecot] dbox redesign

2009-02-11 Thread Timo Sirainen
On Wed, 2009-02-11 at 14:32 -0800, Seth Mattinen wrote:
 Timo Sirainen wrote:
  This is about how to implement multiple msgs/file dbox format. The
  current v1.1's one msg/file design would stay pretty much the same and
  it would be compatible with this new design.
  
 
 Out of curiosity, what's the advantage to going to multiple messages per
 file? Wouldn't this have the same problems as mbox?

Multiple per file, not everything in one file. As long as the file size
is set right, it's probably faster than one per file. We'll see :)



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] dbox redesign

2009-02-11 Thread Timo Sirainen
On Wed, 2009-02-11 at 17:35 -0500, Timo Sirainen wrote:
 On Wed, 2009-02-11 at 14:32 -0800, Seth Mattinen wrote:
  Timo Sirainen wrote:
   This is about how to implement multiple msgs/file dbox format. The
   current v1.1's one msg/file design would stay pretty much the same and
   it would be compatible with this new design.
   
  
  Out of curiosity, what's the advantage to going to multiple messages per
  file? Wouldn't this have the same problems as mbox?
 
 Multiple per file, not everything in one file. As long as the file size
 is set right, it's probably faster than one per file. We'll see :)

Also there are no locking issues since reading doesn't require locking
and write locks are very short lived. Corruption isn't possible because
data is never copied within a file. A crash can happen at any point and
Dovecot will be able to recover from it 100%. The worst that can happen
is that some extra garbage is left lying around for some time wasting
disk space.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] dbox redesign

2007-05-18 Thread Timo Sirainen
On Wed, 2007-05-16 at 20:27 +0200, Gunter Ohrner wrote:
 Am Mittwoch, 16. Mai 2007 schrieb Timo Sirainen:
   Yes, I think treating mailboxes similary to keywords is ideal.  There
  Except if you want to handle some mailboxes in a special way it's
  easier if they're separated on disk. Such as renaming or deleting
  mailboxes is a lot easier.They're based on filtering rules. I don't 
 think they support copying
 messages. So the virtual folders are easily rebuilt by just re-applying
 the filters into all the messages.
 
 Not neccessarily if you add one level of indirection, simply numbering the 
 mailboxes by index numbers internally and providing a number/name mapping 
 somewhere. This way, a mailbox can be renamed easily simply by updating 
 the map, and might by deleted by removing the map entry. Stale index 
 number may be left in the messages and might cleaned up the next time a 
 message's folder list is updated or messages are expunged.

Right. This would also make it use less space inside the dbox files.
There already exists a mailbox list index in v1.1 which contains mailbox
ID - name mappings. But I'm still a bit concerned of its stability.
There are two things that could be done:

1) Have another human readable mailbox ID - name mapping file which is
used if the binary index is corrupted. If mailboxes are
created/deleted/renamed often, this would just slow things down. Might
be a good idea optionally though.

2) If the ID - name mapping is lost, the mailboxes could be created
using those IDs as their names. That would be a lot better than just
having all the mails merged into a single mailbox. As additional help,
there could be a couple of built-in mailbox IDs for INBOX, Trash and
Drafts. Perhaps that could be admin-configurable, but then again adding
new IDs could make it conflict with existing ones. Perhaps just a single
1=INBOX would be enough..

The mailbox IDs could have a validity number as well, similar to
UIDVALIDITY for message UIDs. That would make sure that it's safe to use
the validity+ID combination to uniquely and permanently identify a
mailbox, even if the mailbox list mapping was completely rebuilt (in
that case it would get a new validity).


signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] dbox redesign

2007-05-18 Thread Gunter Ohrner
Am Samstag, 19. Mai 2007 schrieb Timo Sirainen:
 1) Have another human readable mailbox ID - name mapping file which
 is used if the binary index is corrupted. If mailboxes are
 created/deleted/renamed often, this would just slow things down. Might
 be a good idea optionally though.

The Mailbox structure usually is not changed that often. Maybe just 
provide a way to dump/export the current mapping to a specially formatted 
text file and a way to manually load/import a provided dump file.

This way, administrators can configure daily cron jobs to dump the current 
mailbox state and if a mapping really gets lost, a pretty good mapping 
could be reconstructed without any runtime penalty.

 2) If the ID - name mapping is lost, the mailboxes could be created
 using those IDs as their names.

Yes, for example, with the option to overwrite this synthesized mapping 
with the latest dump.

Greetings,

  Gunter

-- 
*** Powered by AudioScrobbler -- http://www.last.fm/user/Interneci/ ***
15:30 | Within Temptation - The Promise
15:24 | Within Temptation - Mother Earth
15:19 | Within Temptation - Ice Queen
14:21 | Within Temptation - What Have You Done (Rock Mix)
*** PGP-Verschlüsselung bei eMails erwünscht :-) *** PGP: 0x1128F25F ***


pgpIFBeMYZBby.pgp
Description: PGP signature


Re: [Dovecot] dbox redesign

2007-05-16 Thread Bill Boebel
On Sat, May 12, 2007 9:10 am, Timo Sirainen [EMAIL PROTECTED] said:

 Fast copying
 
 
 Would be nice if copying a message from one mailbox to another wouldn't
 require actually reading+writing the whole message contents. But I can't
 really figure out how to implement this without requiring that there is
 only a single dbox storage which contains the mails for all the
 mailboxes, and the mailboxes themselves are just Dovecot's index files
 containing pointers to the dbox storage.
 
 The problem with having everything in one storage is that if the index
 files are broken, the messages can't be placed into correct mailboxes
 anymore.
 
 Although one possibility would be treat mailboxes a bit similarly than
 keywords. So that when a message is copied to another mailbox, the
 message in dbox file is updated to contain information that it exists in
 such and such mailboxes. Hmm. Perhaps that would be good enough, yes.
 

Yes, I think treating mailboxes similary to keywords is ideal.  There really is 
no reason to physically separate mailboxes on disk.  All that is needed is this 
logical separation if it can be done in a reliable way.

Or maybe track this in mailbox-specific index files, and also have a 
corespodning text file that stores a list of messages that are contained in 
that mailbox... similar to maildir's dovecot-uidlist file.  Then if you lose 
the index you can rebuild the index from the text file.

Bill



Re: [Dovecot] dbox redesign

2007-05-16 Thread Timo Sirainen
On Wed, 2007-05-16 at 06:40 -0400, Bill Boebel wrote:
  Although one possibility would be treat mailboxes a bit similarly than
  keywords. So that when a message is copied to another mailbox, the
  message in dbox file is updated to contain information that it exists in
  such and such mailboxes. Hmm. Perhaps that would be good enough, yes.
  
 
 Yes, I think treating mailboxes similary to keywords is ideal.  There
 really is no reason to physically separate mailboxes on disk.  All
 that is needed is this logical separation if it can be done in a
 reliable way.

Except if you want to handle some mailboxes in a special way it's easier
if they're separated on disk. Such as renaming or deleting mailboxes is
a lot easier.

 Or maybe track this in mailbox-specific index files, and also have a
 corespodning text file that stores a list of messages that are
 contained in that mailbox... similar to maildir's dovecot-uidlist
 file.  Then if you lose the index you can rebuild the index from the
 text file.

Except that such mailbox-messagelist file could also be counted as
index file, and losing it again loses the messages :) That's why I
thought saving the mailbox name in the message file's headers would be
better. If you then lose the mailbox name, you most likely have lost the
message itself as well. Also it makes it easier to restore individual
messages from backups.



signature.asc
Description: This is a digitally signed message part


Re: [Dovecot] dbox redesign

2007-05-16 Thread Charles Marcus
Would be nice if copying a message from one mailbox to another 
wouldn't require actually reading+writing the whole message

contents. But I can't really figure out how to implement this
without requiring that there is only a single dbox storage which
contains the mails for all the mailboxes, and the mailboxes
themselves are just Dovecot's index files containing pointers to
the dbox storage.

The problem with having everything in one storage is that if the 
index files are broken, the messages can't be placed into correct 
mailboxes anymore.


Although one possibility would be treat mailboxes a bit similarly 
than keywords. So that when a message is copied to another mailbox,
the message in dbox file is updated to contain information that it 
exists in such and such mailboxes. Hmm. Perhaps that would be good

enough, yes.


Yes, I think treating mailboxes similary to keywords is ideal.  There 
really is no reason to physically separate mailboxes on disk.  All 
that is needed is this logical separation if it can be done in a 
reliable way.


Or maybe track this in mailbox-specific index files, and also have a 
corespodning text file that stores a list of messages that are 
contained in that mailbox... similar to maildir's dovecot-uidlist 
file.  Then if you lose the index you can rebuild the index from the 
text file.


This sounds suspiciously like 'virtual folders', that are supported by 
both Evolution and Thunderbird... how do they do it?


--

Best regards,

Charles


Re: [Dovecot] dbox redesign

2007-05-16 Thread Timo Sirainen
On Wed, 2007-05-16 at 07:47 -0400, Charles Marcus wrote:
  Although one possibility would be treat mailboxes a bit similarly 
  than keywords. So that when a message is copied to another mailbox,
  the message in dbox file is updated to contain information that it 
  exists in such and such mailboxes. Hmm. Perhaps that would be good
  enough, yes.
 
  Yes, I think treating mailboxes similary to keywords is ideal.  There 
  really is no reason to physically separate mailboxes on disk.  All 
  that is needed is this logical separation if it can be done in a 
  reliable way.
  
  Or maybe track this in mailbox-specific index files, and also have a 
  corespodning text file that stores a list of messages that are 
  contained in that mailbox... similar to maildir's dovecot-uidlist 
  file.  Then if you lose the index you can rebuild the index from the 
  text file.
 
 This sounds suspiciously like 'virtual folders', that are supported by 
 both Evolution and Thunderbird... how do they do it?

They're based on filtering rules. I don't think they support copying
messages. So the virtual folders are easily rebuilt by just re-applying
the filters into all the messages.



signature.asc
Description: This is a digitally signed message part


[Dovecot] dbox redesign

2007-05-12 Thread Timo Sirainen
I don't think anyone uses dbox currently, so the whole format could
still be redesigned. So I was thinking about doing two major changes:

1. Rely on index files a lot more. The flags are already stored in index
files, so there's no need to waste I/O updating them to dbox files all
the time. They could still be updated (if indexes get deleted, the flags
aren't all gone), but less often.

2. Require fcntl() locking. Currently dbox uses dotlocks which is slow.

Cydir could be a good alternative also once index file code is made a
bit more robust. Perhaps I could implement single instance attachments
for cydir too..

Locking
===

The current dbox index file would be gone. It's pretty useless.
Replace it with a whole new index file. Or perhaps it should be called
locks file or something.

The locks file would contain records: file id lock timestamp
status. So something like:

header
1 4645a60f 0
2  N
3  D

That would mean that the first file is locked by some process, either
for appending or expunging. The 2nd and 3rd files aren't locked. New
messages can't be appended to 2nd file anymore. 3rd file is already
deleted and this record needs to be removed when rewriting the file.

Locking a dbox file for either appends or expunges is done like:

1. See if timestamp is zero
 - If not, see if it's older than .. let's say a day or so ..
a) Yes: Continue to 2.
b) No: Assume the file is locked
2. Do fcntl() byte range lock over the record line in the locks file.
 - If it failed, the record is locked
3. Write the timestamp.
4. Compare stat() and fstat() inodes to see if the file was rebuilt
  - If yes, reopen the file and goto 1
5. File is now locked. Do the append/expunge.
6. Write timestamp to zero.
7. Unlock the byte range.

If a file is locked, append will try another file and expunge will mark
the message as expunged instead of actually expunging it yet.

Note that the locks file is read without locking. This is safe because
data is never moved within the file, and it doesn't matter if the
timestamp isn't read correctly always. The timestamp check is only an
optimization. Actually I'm not sure if it would be better not to have
the timestamp at all.

Deleted records will stay in the file until the file is rebuilt. If a
deleted record is noticed in the file, the process tries to lock the
whole locks file. If it succeeds, it proceeds with writing the
non-deleted records to a temporary file and rename()ing it over the
locks file.

Appending
=

1. Find the first file in the locks file that has appendable=0
 - If no such file was found, go to create a new file logic as
described below
2. Lock the file record
3. Verify from the file's headers that this file can actually be
appended to
 - If messages have been expunged from a dbox file, it can't be safely
appended to anymore.
 - Other reasons include eg. configurable max. file size and daily
rotations
4. Write the mail
5. Lock locks file's header
6. Get the UID from next uid field and update it
7. Unlock the header
8. Unlock the file record
9. Update index file

Create a new file logic:

1. Create a temporary file
2. Write the messages there
3. Lock locks file's header (including fstat() / stat() rebuild check)
4. See what the latest file ID is in the file
5. rename() temp fail to msg.latest file id+1
6. Lock locks file for the range of the to-be-written record below
7. Write the new record to locks file
8. Go to step 6 in the original append logic

Syncing / expunging
===

If locks file's header's next uid doesn't match the one currently in
index file, the appending crashed between steps 6 and 9. Find the new
message(s) and append them to index file.

If the locks file is completely gone , rebuild it by going through all
the msg.* files in the directory.

If expunge counter (see below) doesn't match in locks file's header
vs. index file header, go through all the msg.* files to see if a
message exists in multiple files. If found, remove the duplicates.

Typically neither of the above happens, so the only thing to do here is
to write changes from index file to dbox files. This may mean flag
changes once in a while, but most importantly expunges will always be
synced.

Initially figure out what files require expunging. Try to create a
single lock range that includes all of them. If there are non-zero lock
timestamps in that range, create multiple ranges.

If a file couldn't be locked, the expunge is done by updating
expunge-flag in the file. This can be done without locking (see below
for flag updates).

If a file was successfully locked, the expunging is done by:

1. Update expunge offset in file header to the offset of the first
expunged message. If expunge offset is non-zero, the file is treated as
non-appendable. Also when rebuilding and finding the same message from
multiple files, this field is used to figure out which file should be
truncated.

2. Copy the rest of the non-expunged messages to a new temporary file