On 12.4.2012, at 15.10, Ed W wrote: > On 12/04/2012 12:09, Timo Sirainen wrote: >> On 12.4.2012, at 13.58, Ed W wrote: >> >>> The claim by ZFS/BTRFS authors and others is that data silently "bit rots" >>> on it's own. The claim is therefore that you can have a raid1 pair where >>> neither drive reports a hardware failure, but each gives you different data? >> That's one reason why I planned on adding a checksum to each message in >> dbox. But I forgot to actually do that. I guess I could add it for new >> messages in some upcoming version. Then Dovecot could optionally verify the >> checksum before returning the message to client, and if it detects >> corruption perhaps automatically read it from some alternative location >> (e.g. if dsync replication is enabled ask from another replica). And Dovecot >> index files really should have had some small (8/16/32bit) checksums of >> stuff as well.. >> > > I have to say - I haven't actually seen this happen... Do any of your big > mailstore contacts observe this, eg rackspace, etc?
I haven't heard. But then again people don't necessarily notice if it has. > Things I might like to do *if* there were some suitable "checksums" available: > - Use the checksum as some kind of guid either for the whole message, the > message minus the headers, or individual mime sections Messages already have a GUID. And the rest of that is kind of done with the single instance storage stuff.. I was thinking of using SHA1 of the entire message with headers as the checksum, and save it into dbox metadata field. I also thought about checksumming the metadata fields as well, but that would need another checksum as the first one can have other uses as well besides verifying the message integrity. > - Use the checksums to assist with replication speed/efficiency (dsync or > custom imap commands) It would be of some use with dbox index rebuilding. I don't think it would help with dsync. > - File RFCs for new imap features along the "lemonde" lines which allow > clients to have faster recovery from corrupted offline states... Too much trouble, no one would implement it :) > - Storage backends where emails are redundantly stored and might not ALL be > on a single server (find me the closest copy of email X) - derivations of > this might be interesting for compliance archiving of messages? > - Fancy key-value storage backends might use checksums as part of the key > value (either for the whole or parts of the message) GUID would work for these as well, without the possibility of a hash collision.