Re: [Dovecot] Slightly more intelligent way of handling issues in sdbox?

2012-02-08 Thread Timo Sirainen
On 7.2.2012, at 14.08, Mark Zealey wrote:

>> http://hg.dovecot.org/dovecot-2.1/rev/a765e0a895a9 fixes this.
> 
> I've not actually tried this patch yet, but looking at it, it is perhaps 
> useful for the situation I described below when the index is corrupt. In this 
> case I am describing however, the not is NOT corrupt - it is simply an older 
> version (ie it only thinks there are the first 2 mails in the directory, not 
> the 3rd). This could happen for example when mails are being stored on 
> different storage than indexes; say for example you have 2 servers with 
> remote NFS stored mails but local indexes that rsync between the servers 
> every hour. You manually fail over one server to the other and you then have 
> a copy of the correct indexes but only from an hour ago. The mails are all 
> there on the shared storage but because the indexes are out of date, when a 
> new message comes in it will be automatically overwritten.

I don't recommend using local indexes with dbox, since there is actual data 
loss if they're not up to date (flags, and with mdbox the user may have 
copied/moved the mail elsewhere). Still, better to catch this situation than 
not:
http://hg.dovecot.org/dovecot-2.1/rev/09db0f7aa6ce

>>> (speaking of which, it would be great if force-resync also rebuilt the 
>>> cache files if there are valid cache files around, rather than just doing 
>>> away with them)
>> Well, ideally there shouldn't be so much corruption that this matters..
> 
> That's true, but in our experience we usually get corruption in batches 
> rather than a one-off occurrence. Our most common case is something like 
> this: Say for example there's an issue with the NFS server (assuming we are 
> storing indexes on there as well now) and so we have to killall -9 dovecot 
> processes or similar. In that case you get a number of corrupted indexes on 
> the server. Rebuilding the indexes generates an IO storm (say via lmtp or a 
> pop3 access); then the clients log in via imap and we have to re-read all the 
> messages to generate the cache files which is a second IO storm. If the 
> caches were rebuilt at least semi-intelligently (ie you could extract from 
> the cache files a list of things that had previously been cached) that would 
> reduce the effects of rare storage level issues such as this.

Well, the decisions are now remembered: 
http://hg.dovecot.org/dovecot-2.1/rev/d8d214cc1936

That can't really be improved .. If nothing is deleted from cache, it might 
contain invalid data and doveadm force-resync wouldn't be doing its job right. 
If anything is added to cache, it would require reading and parsing the mail 
contents during rebuild, and that's not in any way better than letting the imap 
processes do it later when the mailbox isn't locked.

Re: [Dovecot] Slightly more intelligent way of handling issues in sdbox?

2012-02-07 Thread Mark Moseley
On Tue, Feb 7, 2012 at 4:08 AM, Mark Zealey  wrote:
> 06-02-2012 22:47, Timo Sirainen yazmış:
>
>> On 3.2.2012, at 16.16, Mark Zealey wrote:
>>
>>> I was doing some testing on sdbox yesterday. Basically I did the
>>> following procedure:
>>>
>>> 1) Create new sdbox; deliver 2 messages into it (u.1, u.2)
>>> 2) Create a copy of the index file (no cache file created yet)
>>> 3) deliver another message to the mailbox (u.3)
>>> 4) copy back index file from stage (2)
>>> 5) deliver new mail
>>>
>>> Then the message delivered in stage 3 ie u.3 gets replaced with the
>>> message delivered in (5) also called u.3.
>>
>> http://hg.dovecot.org/dovecot-2.1/rev/a765e0a895a9 fixes this.
>
>
> I've not actually tried this patch yet, but looking at it, it is perhaps
> useful for the situation I described below when the index is corrupt. In
> this case I am describing however, the not is NOT corrupt - it is simply an
> older version (ie it only thinks there are the first 2 mails in the
> directory, not the 3rd). This could happen for example when mails are being
> stored on different storage than indexes; say for example you have 2 servers
> with remote NFS stored mails but local indexes that rsync between the
> servers every hour. You manually fail over one server to the other and you
> then have a copy of the correct indexes but only from an hour ago. The mails
> are all there on the shared storage but because the indexes are out of date,
> when a new message comes in it will be automatically overwritten.
>
>>> (speaking of which, it would be great if force-resync also rebuilt the
>>> cache files if there are valid cache files around, rather than just doing
>>> away with them)
>>
>> Well, ideally there shouldn't be so much corruption that this matters..
>
>
> That's true, but in our experience we usually get corruption in batches
> rather than a one-off occurrence. Our most common case is something like
> this: Say for example there's an issue with the NFS server (assuming we are
> storing indexes on there as well now) and so we have to killall -9 dovecot
> processes or similar. In that case you get a number of corrupted indexes on
> the server. Rebuilding the indexes generates an IO storm (say via lmtp or a
> pop3 access); then the clients log in via imap and we have to re-read all
> the messages to generate the cache files which is a second IO storm. If the
> caches were rebuilt at least semi-intelligently (ie you could extract from
> the cache files a list of things that had previously been cached) that would
> reduce the effects of rare storage level issues such as this.
>
> Mark

What about something like: a writer to an index/cache file checks for
the existence of .1. If it doesn't exist or is over a day
old, if the current index/cache file is not corrupt, take a snapshot
of it as .1. Then if an index/cache file is corrupt, it can
check for .1 and use that as the basis for a rebuild, so at
least only a day's worth of email is reverted to its previous state
(instead of all of it), assuming it's been modified in less than a
day. Clearly it'd take up a bit more disk space, though the various
dovecot.* files are pretty modest in size, even for big mailboxes.

Or it might be a decent use case for some sort of journaling, so that
the actual index/cache files don't ever get written to, except during
a consolidation, to roll up journals once they've reached some
threshold. There'd definitely be a performance price to pay though,
not to mention breaking backwards compatibility.

And I'm just throwing stuff out to see if any of it sticks, so don't
mistake this for even remotely well thought-out suggestions :)


Re: [Dovecot] Slightly more intelligent way of handling issues in sdbox?

2012-02-07 Thread Mark Zealey

06-02-2012 22:47, Timo Sirainen yazmış:

On 3.2.2012, at 16.16, Mark Zealey wrote:


I was doing some testing on sdbox yesterday. Basically I did the following 
procedure:

1) Create new sdbox; deliver 2 messages into it (u.1, u.2)
2) Create a copy of the index file (no cache file created yet)
3) deliver another message to the mailbox (u.3)
4) copy back index file from stage (2)
5) deliver new mail

Then the message delivered in stage 3 ie u.3 gets replaced with the message 
delivered in (5) also called u.3.

http://hg.dovecot.org/dovecot-2.1/rev/a765e0a895a9 fixes this.


I've not actually tried this patch yet, but looking at it, it is perhaps 
useful for the situation I described below when the index is corrupt. In 
this case I am describing however, the not is NOT corrupt - it is simply 
an older version (ie it only thinks there are the first 2 mails in the 
directory, not the 3rd). This could happen for example when mails are 
being stored on different storage than indexes; say for example you have 
2 servers with remote NFS stored mails but local indexes that rsync 
between the servers every hour. You manually fail over one server to the 
other and you then have a copy of the correct indexes but only from an 
hour ago. The mails are all there on the shared storage but because the 
indexes are out of date, when a new message comes in it will be 
automatically overwritten.

(speaking of which, it would be great if force-resync also rebuilt the cache 
files if there are valid cache files around, rather than just doing away with 
them)

Well, ideally there shouldn't be so much corruption that this matters..


That's true, but in our experience we usually get corruption in batches 
rather than a one-off occurrence. Our most common case is something like 
this: Say for example there's an issue with the NFS server (assuming we 
are storing indexes on there as well now) and so we have to killall -9 
dovecot processes or similar. In that case you get a number of corrupted 
indexes on the server. Rebuilding the indexes generates an IO storm (say 
via lmtp or a pop3 access); then the clients log in via imap and we have 
to re-read all the messages to generate the cache files which is a 
second IO storm. If the caches were rebuilt at least semi-intelligently 
(ie you could extract from the cache files a list of things that had 
previously been cached) that would reduce the effects of rare storage 
level issues such as this.


Mark


Re: [Dovecot] Slightly more intelligent way of handling issues in sdbox?

2012-02-06 Thread Timo Sirainen
On 3.2.2012, at 16.16, Mark Zealey wrote:

> I was doing some testing on sdbox yesterday. Basically I did the following 
> procedure:
> 
> 1) Create new sdbox; deliver 2 messages into it (u.1, u.2)
> 2) Create a copy of the index file (no cache file created yet)
> 3) deliver another message to the mailbox (u.3)
> 4) copy back index file from stage (2)
> 5) deliver new mail
> 
> Then the message delivered in stage 3 ie u.3 gets replaced with the message 
> delivered in (5) also called u.3.

http://hg.dovecot.org/dovecot-2.1/rev/a765e0a895a9 fixes this.

> Is it possible to try an open/access call on the mail file before overwriting 
> it with the new message in case we have an issue where an older version of 
> the index file is present (eg due to nfs latencies) ? I notice when you are 
> expunging files you very carefully open them and read the header contents to 
> make sure the guid is the same as in the index - any reason that this is not 
> done when delivering? This is with lmtp on dovecot 2.0.16.

Hm. Yes, I guess there should be a check to avoid overwriting files.

> I also noticed that index corruption in sdbox does not get automatically 
> repaired. I know this is because the flags are stored in the index files so 
> you'd get some loss of flags, but in many situations for us this auto-repair 
> with flag loss would be better than having the mailbox locked out until we 
> manually do a force-resync on it.

I'm not entirely sure what you mean by this. Does the above patch help with 
this problem also?

> (speaking of which, it would be great if force-resync also rebuilt the cache 
> files if there are valid cache files around, rather than just doing away with 
> them)

Well, ideally there shouldn't be so much corruption that this matters..

[Dovecot] Slightly more intelligent way of handling issues in sdbox?

2012-02-03 Thread Mark Zealey

Hi there,

I was doing some testing on sdbox yesterday. Basically I did the 
following procedure:


1) Create new sdbox; deliver 2 messages into it (u.1, u.2)
2) Create a copy of the index file (no cache file created yet)
3) deliver another message to the mailbox (u.3)
4) copy back index file from stage (2)
5) deliver new mail

Then the message delivered in stage 3 ie u.3 gets replaced with the 
message delivered in (5) also called u.3. Is it possible to try an 
open/access call on the mail file before overwriting it with the new 
message in case we have an issue where an older version of the index 
file is present (eg due to nfs latencies) ? I notice when you are 
expunging files you very carefully open them and read the header 
contents to make sure the guid is the same as in the index - any reason 
that this is not done when delivering? This is with lmtp on dovecot 2.0.16.


I also noticed that index corruption in sdbox does not get automatically 
repaired. I know this is because the flags are stored in the index files 
so you'd get some loss of flags, but in many situations for us this 
auto-repair with flag loss would be better than having the mailbox 
locked out until we manually do a force-resync on it. (speaking of 
which, it would be great if force-resync also rebuilt the cache files if 
there are valid cache files around, rather than just doing away with them)


Thanks,

Mark