Re: De-duping attachments
Hi, Shuvam Misra schrieb am 15.09.2010 03:40 Uhr: How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file. Whenever the message is fetched for any reason, the original MIME-encoded message will be re-constructed and delivered. mh, OK, I'm not sure if I should post this here, but the is another IMAP server (some say: the new kid on the block) and as I came above it a few day ago, they are working on the same thing: http://blog.dovecot.org/2010/07/single-instance-attachment-storage.html I did not follow the discussion, but at least there is a discussion about this feature elsewhere. Marc Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, Sep 15, 2010 at 08:40:59AM +0530, Shuvam Misra wrote: Dear Rob, I had reservations about some of these things too. :( In particular, I was wondering about having to remember and recreate the exact transfer-encoding. If both of us forward the same attachment in two emails, and one encodes in quoted-printable, the other in base64, Cyrus had better be able to recreate them exactly or have some other workarounds. I wasn't aware of the mmap() usage and the direct seeking into the middle of the message body. But the bigger problem is what you've described about reproducing the message byte-identically. If that can be solved, then we can make Cyrus re-create the message while loading from disk and stick it into RAM. There's not actually THAT much parsing of the message body. I would guess it's about 9 places: imap/cyrdump.c 250:r = mailbox_map_message(state-mailbox, uids[i], base, len); imap/index.c 1013:if (mailbox_map_message(mailbox, im-record.uid, 1535:if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size)) 2441: if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size)) { 2716:if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size)) 3152: if (mailbox_map_message(mailbox, im-record.uid, 3337: if (mailbox_map_message(mailbox, uid, msgfile.base, msgfile.size)) { 5112: if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size)) (those 8 plus one in imap/message.c where it gets parsed originally) Can we just brainstorm with you and others in this thread... how do we re-create a byte-identical attachment from a disk file? What is the list of attributes we will need to store per stripped attachment to allow an exact re-creation? I did a bunch of work on this a while back. Basically for the byte idential reverse, as I said - keep a list of the most common mapping functions and try to figure out which one it is algorithmically. In theory we can work out what the common ones are pretty fast. - file name/reference - full MIME header of the attachment block - separator string (this will be retained in the message body anyway) - transfer encoding All this stuff I'd keep as a binary diff from the nearly right re-encoding. - if encoding = base64 then base64 line length Yeah, that's an interesting one. Assuming it's not totally pathological there will be some base64 pattern you can find quickly. - checksum of encoded attachment (as a sanity check in case the re-encoding fails to recreate exactly the same image as the original) We like sha1s. If encoding = quoted-printable or uuencode, then don't strip the attachment at all. Makes sense. There might be some size based logic here too - only bother applying this on messages over 20k, and where the attachment is at least 20k in size. Anything smaller than that is pretty pointless. What other conditions may we need to look for to bypass attachment stripping? Can we just tap into all of you to get the ideas on paper, even if it's not being implemented by anyone right now? It'll at least help us understand the system's internals better. Sure. Ideas are good :) I don't think I'm sold on the value though. And given that Rob is actually the one who argued me down from implementing this years ago ;) But maybe our use case isn't the same as yours. Bron. Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote: Dear Bron, http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Don't want to nit-pick here, but the effective price we pay is about ten times this. Yeah, so? It's going down. That's a large number of attachments we're talking about there. To set up a mail server with a few TB of disk space, we usually land up deploying a separate chassis with RAID controllers and a RAID array, with FC connections from servers, etc, etc. All this adds up to about $1,000/TB of usable space if you're using something like the low-end IBM DS3400 box or Dell/EMC equivalent. This is even with inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives. Hmm... our storage units with metadata on SSD come in about $1200/TB. Yes, that sounds about right. That's including hot spares, RAID1 on everything (including the SSDs), scads of processor and memory. Obviously multiply that by two for replication, and add in a bit of extra for backups and I'm happy to arrive at a figure of approximately $3000 per terabyte of actual email. And most of our customers actually double this cost because they keep two physically identical chassis for redundancy. (We recommend this too, because we can't trust a single RAID 5 array to withstand controller or PSU failures.) In that case, it's $2000/TB. And because it's nice not to have downtime when you're doing maintainence. I replaced an entire drive unit today, including about 4 hours downtime on one of our servers as the system was swamped with IO creating new filesystems and initialising the drives. The users didn't see a thing, and repliation is now fully operational again. And you do reach 5-10 TB of email store quite rapidly --- our company has many corporate clients ( 500 email users) whose IMAP store has reached 4TB. No one wants to enforce disk quotas (corporate policy), and most users don't want to delete emails on their own. So you save, what, 50%. Does that sound about right? Do you have statistics on how much space you'd save with this theoretical patch? We keep hearing the logic that storage is cheap, and stories of cloud storage through Amazon, unlimited mailboxes on Gmail, are reinforcing the belief. But at the ground level in mid-market corporate IT budgets, storage costs in data centres (as against inside desktops) are still too high to be trivial, and their prices have only little to do with the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little over $12,000 in India, with a full set of 1TB SATA-II drives from IBM, but even with high cost of IBM drives, the drives themselves contribute less than 30% of the total cost. You're buying a few months. Usage grows to fill the available storage, whatever it is. And you can only pull this piece of magic once. If we really want to put our collective money where our mouth is, and deliver the storage-is-cheap promise at the ground level, we need to rearchitect every file server and IMAP server to work in map-reduce mode and use disks inside desktops. Anyone game for this project? :) You could buy as much benefit much more quickly by gzipping the individual email files. Either a filesystem that stores files compressed, or a cyrus patch to do that and unpack files on the fly if the body was read. Along with most/all headers in the cyrus.cache file, the body doesn't get opened very often. Man a body search would hurt though! Now de-duping messages on copy is valuable, not so much because of the space it saves, but because of the IO it saves. Copying the file around is expensive. De-duping componenets of messages and then reconstructing? Not so much. You'll be causing MORE IO in general looking for the message, finding the parts. I agree. My aim was not to reduce IOPS but to cut disk space usage. IOPS matter too. Depending on your usage patterns obviously. If you don't ever get body searches on your server they probably matter less. A 500-user company can easily acquire an email archive of 2-5TB. I don't care how much the IO load of that archive server increases, but I'd like to reduce disk space utilisation. If the customer can stick to 2TB of space requirements, he can use a desktop with two 2TB drives in RAID 1, and get a real cheap archive server. If this figure reaches 3-4TB, he goes into a separate RAID chassis --- the hardware cost goes up 5-10 times. These are tradeoffs a lot of small to mid-sized companies in my market fuss about. Sounds like a case for a cheaper RAID chassis to me. Or actually cleaning up a little. While I appreciate the tradeoff, I think they'll still fill up pretty quickly even with this. It's a short term stop-gap measure. And in a more generic context, I am seeing that all kinds of intelligent de-duping of infrequently-accessed data is going to become the crying need of every mid-sized and large company. Data
Re: De-duping attachments
On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote: Dear Bron, http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Don't want to nit-pick here, but the effective price we pay is about ten times this. Yeah, so? It's going down. That's a large number of attachments we're talking about there. To set up a mail server with a few TB of disk space, we usually land up deploying a separate chassis with RAID controllers and a RAID array, with FC connections from servers, etc, etc. All this adds up to about $1,000/TB of usable space if you're using something like the low-end IBM DS3400 box or Dell/EMC equivalent. This is even with inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives. Hmm... our storage units with metadata on SSD come in about $1200/TB. Yes, that sounds about right. That's including hot spares, RAID1 on everything (including the SSDs), scads of processor and memory. Obviously multiply that by two for replication, and add in a bit of extra for backups and I'm happy to arrive at a figure of approximately $3000 per terabyte of actual email. And most of our customers actually double this cost because they keep two physically identical chassis for redundancy. (We recommend this too, because we can't trust a single RAID 5 array to withstand controller or PSU failures.) In that case, it's $2000/TB. And because it's nice not to have downtime when you're doing maintainence. I replaced an entire drive unit today, including about 4 hours downtime on one of our servers as the system was swamped with IO creating new filesystems and initialising the drives. The users didn't see a thing, and repliation is now fully operational again. And you do reach 5-10 TB of email store quite rapidly --- our company has many corporate clients ( 500 email users) whose IMAP store has reached 4TB. No one wants to enforce disk quotas (corporate policy), and most users don't want to delete emails on their own. So you save, what, 50%. Does that sound about right? Do you have statistics on how much space you'd save with this theoretical patch? We keep hearing the logic that storage is cheap, and stories of cloud storage through Amazon, unlimited mailboxes on Gmail, are reinforcing the belief. But at the ground level in mid-market corporate IT budgets, storage costs in data centres (as against inside desktops) are still too high to be trivial, and their prices have only little to do with the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little over $12,000 in India, with a full set of 1TB SATA-II drives from IBM, but even with high cost of IBM drives, the drives themselves contribute less than 30% of the total cost. You're buying a few months. Usage grows to fill the available storage, whatever it is. And you can only pull this piece of magic once. If we really want to put our collective money where our mouth is, and deliver the storage-is-cheap promise at the ground level, we need to rearchitect every file server and IMAP server to work in map-reduce mode and use disks inside desktops. Anyone game for this project? :) You could buy as much benefit much more quickly by gzipping the individual email files. Either a filesystem that stores files compressed, or a cyrus patch to do that and unpack files on the fly if the body was read. Along with most/all headers in the I guess much more efficient than a compressing filesystem would be a compressing and de-duping filesystem or disk storage in this case. Has anyone tried this with a Cyrus message store with lots of corporate message data stored on it? Simon Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, September 15, 2010 10:01 am, Simon Matter wrote: I guess much more efficient than a compressing filesystem would be a compressing and de-duping filesystem or disk storage in this case. Has anyone tried this with a Cyrus message store with lots of corporate message data stored on it? Simon, The Cyrus server I hope to get online tomorrow evening holds 4.2 TB of mail and uses ZFS with maximal compression (gzip9) for the message files. (OS : Solaris 10) ZFS reports a compressratio of between 1.95 and 1.97 (we have nine partitions) A series of tests revealed our metadata can actually be compressed by a factor of 3.76 (!) Perhaps a two-university environment with 60,000+ users doesn't quite qualify as corporate enough but here you have our figures :-) Regards, Eric. Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Dear Bron, So you save, what, 50%. Does that sound about right? Do you have statistics on how much space you'd save with this theoretical patch? No, and this is the first thing I want to do. I'm getting some simple utilities developed which will run all week (niced suitably) and extract and MD5sum each attachment. I'll then count how many unique message-IDs have the same unique document, and I'll get a report. This has been under discussion in our group for some time --- let me get this done and I'll let all of you know. You're buying a few months. Usage grows to fill the available storage, whatever it is. And you can only pull this piece of magic once. Unfortunately, you're totally right. The junk will keep growing. Shuvam Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
The sparse file idea is brilliant! Never occurred to me. :) We'd have to store the reference-pointer in the message file, so we would omit the actual attachment but eat up perhaps 50 bytes to keep the reference to the file. Shuvam 1. Completely rewrite the message file removing the attachments and adding any extra meta data you want in it's place 2. Leave the message file as exactly the same size, just don't write out the attachment content and assume your filesystem supports sparse files (http://en.wikipedia.org/wiki/Sparse_file) The advantage of 2 is that it leaves the message file size correct, and all the offsets in the file are still correct. The downsides are that you must ensure your FS supports sparse files well, and there's the question of where do you actually store the information that links to the external file? Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Makes sense. There might be some size based logic here too - only bother applying this on messages over 20k, and where the attachment is at least 20k in size. Anything smaller than that is pretty pointless. Yes, absolutely. Left to myself, I'd not have bothered with any attachment less than 100KBytes or so. The stuff that gets my goat is seeing our customers using email to shunt 20MB CAD files back and forth across the world two dozen times. Emails are being used for the kind of work God had meant trucks to do. :( Sure. Ideas are good :) I don't think I'm sold on the value though. And given that Rob is actually the one who argued me down from implementing this years ago ;) But maybe our use case isn't the same as yours. Let me get some hard data from a few of our large corporate clients' servers, and then we'll talk again. May take a couple of weeks to get this data, because we'll need to look for a time window when the mail server is less loaded to run our scan. Shuvam Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Great thread. Here as some real world numbers based on our spools here at BU. One of our masters has 4,800 users, 22,000 mailboxes, and is using about 374G of disk. Based on the md5 files for these users there are 6,046,363 messages. If I look at the first md5 value (md5 on the msg if I understand this) and sort and uniq I get 5,891,974 messages, so assuming we dedup all those messages that would be a shrink to 97.4% of the original number of messages. Assuming an even distribution of message sizes this would mean 374G would drop down to 362.78G. Unfortunately not an obvious huge win. But, I think the md5 of the message file includes headers which may be more likely to be unique over the body content. (Due to legacy support for UW IMAP, we often end up routing things differently for users on the same master so the headers for the same message sent to 2 people could be different). Isn't the easy hack for dedup just looking at the above md5 files and then doing appropriate hard links? This could be done by a nightly trawl of the spool space. A bigger win would be to separate the headers from the messages but that's a lot more work. -nik Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, September 15, 2010 10:01 am, Simon Matter wrote: I guess much more efficient than a compressing filesystem would be a compressing and de-duping filesystem or disk storage in this case. Has anyone tried this with a Cyrus message store with lots of corporate message data stored on it? Simon, The Cyrus server I hope to get online tomorrow evening holds 4.2 TB of mail and uses ZFS with maximal compression (gzip9) for the message files. (OS : Solaris 10) ZFS reports a compressratio of between 1.95 and 1.97 (we have nine partitions) A series of tests revealed our metadata can actually be compressed by a factor of 3.76 (!) Perhaps a two-university environment with 60,000+ users doesn't quite qualify as corporate enough but here you have our figures :-) Eric, that looks of course interesting. With more corporate style I means much less users but much bigger mailboxes. Enforcing quota in the mulit GB range seems quite common these days. In such environment I expect the compression ratio to increase. But, the big question for me is how much filesystem / block level deduping is going to shrink it? You said ZFS, did you consider testing its built in deduping? (If its even there in Solaris 10?) Simon Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, September 15, 2010 2:12 pm, Simon Matter wrote: You said ZFS, did you consider testing its built in deduping? (If its even there in Solaris 10?) Simon, OpenSolaris does have it (block level dedup) since about one year but it is too recent an addition to the commercial Solaris 10 to start using it (IMO). Apparently (Wikipedia) it is ZFS pool feature 21 listed as 'Reserved' by 'zpool upgrade -v' (h... both 'zfs get all ...' and 'zpool get all ...' do not yield a parameter sounding as 'deduplication' ; it may very well not be there yet) Furthermore, I'd like to repeat what has been written earlier in this thread : a message header that is different in size by even one byte will cause block boundaries to shift and, I suspect, block level dedup to fail. Eric Luyten. Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Outside the cyrus box: The Mimedefang milter has a built-in function (optional of course) to remove an attachment, write it to a file, and replace the attachment part with a text part giving a web link to the file. The files could be on a slower type of disk drive than you need for email storage. You could write code choosing which attachments to do this to, say by size or file extension. A mechanism to remove the files is not provided, but it's suggested that recipients would need to download the attachment to their own computer and that therefore the files could be deleted by a cron job based on age. I mention this only as another way to do it. Note that this could be implemented for outgoing mail too. We have not implemented it here so I can't say more than that it is possible. Joseph Brennan Columbia University Information Technology Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Hi, On Wed, 15 Sep 2010, Nik Conwell wrote: Isn't the easy hack for dedup just looking at the above md5 files and then doing appropriate hard links? This could be done by a nightly trawl of the spool space. A bigger win would be to separate the headers from the messages but that's a lot more work. For what it's worth, I believe the fsdup tool which is part of fslint will do this for you. http://www.pixelbeat.org/fslint/ Gavin Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On 09/14/2010 11:55 PM, Rob Mueller wrote: Eg. An architectural firm might end up sending big blueprint documents back and forth between each other a lot, so they'd gain a lot from deduplication. Not to throw a damp towel on this discussion, but isn't this really an administrative problem rather than a technical one? I.e. shouldn't the system administrator set up a version control system or even something like dropbox for file sharing rather than using email for this situation? if you know the same file is being sent back and forth a lot with minor changes, you might want to store the most recent version, and store binary diffs between the most recent and old versions (eg xdelta). Yes accessing the older versions would be much slower (have to get most recent + apply N deltas), but the space savings could be huge. My users frequently mail documents to the person in the office next door (never mind that both their home directories are on the same server!); however this content is almost always different for each attached file; i.e. without re-implementing a version control system under IMAP, as you're suggesting, there would be little benefit in keeping and hard linking to a single copy of each file. However, that seems like it fails the UNIX do one thing, and do it well test pretty badly. Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, Sep 15, 2010 at 05:24:11PM +0100, Gavin McCullagh wrote: Hi, On Wed, 15 Sep 2010, Nik Conwell wrote: Isn't the easy hack for dedup just looking at the above md5 files and then doing appropriate hard links? This could be done by a nightly trawl of the spool space. A bigger win would be to separate the headers from the messages but that's a lot more work. For what it's worth, I believe the fsdup tool which is part of fslint will do this for you. http://www.pixelbeat.org/fslint/ Or this lovely little toy. It uses the fact that in current versions of Cyrus the GUID field is actually the sha1 of the underlying file. Bron ( warning: may contain FastMail specific assuptions ) #!/usr/bin/perl -w # SETUP {{{ use strict; use warnings; BEGIN { do /home/mod_perl/hm/ME/FindLibs.pm; } use Date::Manip; use MailApp::Admin::Actions; use IO::File; use ME::Machine; use Cyrus::HeaderFile; use Data::Dumper; use Cyrus::IndexFile; use Getopt::Std; use Digest::SHA1; use ME::CyrusBackup; use ME::User; use Data::Dumper; # }}} my $sn = shift; my (undef,undef,$uid,$gid) = getpwnam('cyrus'); foreach my $Slot (ME::Machine-ImapSlots()) { next if ($sn and $sn ne $Slot-Name()); my $users = $Slot-AllMailboxes(); my $conf = $Slot-ImapdConf(); foreach my $user (sort keys %$users) { process($conf, $user, $users-{$user}); } } sub process { my ($conf, $user, $folders) = @_; print $user\n; my %ihave; foreach my $folder (@$folders) { my $meta = $conf-GetUserLocation('meta', $user, 'default', $folder); my $index = Cyrus::IndexFile-new_file($meta/cyrus.index) || die Failed to open $meta/cyrus.index; while (my $record = $index-next_record()) { push @{$ihave{$record-{MessageGuid}}}, [$folder, $record-{Uid}]; } } foreach my $guid (keys %ihave) { next if @{$ihave{$guid}} = 1; my ($inode, $srcname); my @others; foreach my $item (@{$ihave{$guid}}) { my $spool = $conf-GetUserLocation('spool', $user, 'default', $item-[0]); $spool =~ s{/$}{}; my $file = $spool/$item-[1].; my (@sd) = stat($file); if ($inode) { next if $sd[1] == $inode; push @others, $file; } else { $inode = $sd[1]; $srcname = $file; } } next unless @others; print fixing up files for $guid ($srcname)\n; foreach my $file (@others) { my $tmpfile = $file . tmp; print link error $tmpfile\n unless link($srcname, $tmpfile); chown($uid, $gid, $tmpfile); chmod(0600, $tmpfile); print rename error $file\n unless rename($tmpfile, $file); } } } Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file. Whenever the message is fetched for any reason, the original MIME-encoded message will be re-constructed and delivered. Like anything, doable, but quite a lot of work. cyrus likes to mmap the whole file so it can just offset into it to extract which ever part is requested. In IMAP, you can request any arbitrary byte range from the raw RFC822 message using the body[]start.length construct, so you have to be able to byte accurately reconstruct the original email if you remove attachments. Consider the problem of transfer encoding. Say you have a base64 encoded attachment (which basically all are). When storing and deduping, you'd want to base64 decode it to get the underlying binary data. But depending on the line length of the base64 encoded data, the same file can be encoded in a large number of different ways. When you reconstruct the base64 data, you have to be byte accurate in your reconstruction so your offsets are correct, and so any signing of the message (eg DKIM) isn't broken. Once you've solved those problems, the rest is pretty straight forward :) Rob Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
On Wed, Sep 15, 2010 at 12:13:03PM +1000, Rob Mueller wrote: How difficult or easy would it be to modify Cyrus to strip all attachments from emails and store them separately in files? In the message file, replace the attachment with a special tag which will point to the attachment file. Whenever the message is fetched for any reason, the original MIME-encoded message will be re-constructed and delivered. http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Like anything, doable, but quite a lot of work. Now de-duping messages on copy is valuable, not so much because of the space it saves, but because of the IO it saves. Copying the file around is expensive. De-duping componenets of messages and then reconstructing? Not so much. You'll be causing MORE IO in general looking for the message, finding the parts. The only real benefit I can see is something like replication or a client that's downloading multiple of these large messages and wants to save network bandwidth. Except - there's no protocol to support this for client, so only replication could gain. cyrus likes to mmap the whole file so it can just offset into it to extract which ever part is requested. In IMAP, you can request any arbitrary byte range from the raw RFC822 message using the body[]start.length construct, so you have to be able to byte accurately reconstruct the original email if you remove attachments. Consider the problem of transfer encoding. Say you have a base64 encoded attachment (which basically all are). When storing and deduping, you'd want to base64 decode it to get the underlying binary data. But depending on the line length of the base64 encoded data, the same file can be encoded in a large number of different ways. When you reconstruct the base64 data, you have to be byte accurate in your reconstruction so your offsets are correct, and so any signing of the message (eg DKIM) isn't broken. Once you've solved those problems, the rest is pretty straight forward :) Yeah, they really aren't so hard to solve. I didn't actually do the research, but I have an idea what to do. Find a big corpus of emails (i.e. FastMail's one!) and figure out the 10-20 most common base64 widths and surrounding layouts. Choose one of those and store it by a single it's this layout. If none of them match exactly, store a binary diff from the closest one as well, it probably won't be very huge. But in general, I'd say you're optimising the wrong problem. It's just not worth it, the savings are minimal and the added complexity is high. Disk space is now cheap, and fast access via a cached copy of the email will beat re-creating the original file from mime parts hands down. Bron. Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Dear Rob, I had reservations about some of these things too. :( In particular, I was wondering about having to remember and recreate the exact transfer-encoding. If both of us forward the same attachment in two emails, and one encodes in quoted-printable, the other in base64, Cyrus had better be able to recreate them exactly or have some other workarounds. I wasn't aware of the mmap() usage and the direct seeking into the middle of the message body. But the bigger problem is what you've described about reproducing the message byte-identically. If that can be solved, then we can make Cyrus re-create the message while loading from disk and stick it into RAM. Can we just brainstorm with you and others in this thread... how do we re-create a byte-identical attachment from a disk file? What is the list of attributes we will need to store per stripped attachment to allow an exact re-creation? - file name/reference - full MIME header of the attachment block - separator string (this will be retained in the message body anyway) - transfer encoding - if encoding = base64 then base64 line length - checksum of encoded attachment (as a sanity check in case the re-encoding fails to recreate exactly the same image as the original) If encoding = quoted-printable or uuencode, then don't strip the attachment at all. What other conditions may we need to look for to bypass attachment stripping? Can we just tap into all of you to get the ideas on paper, even if it's not being implemented by anyone right now? It'll at least help us understand the system's internals better. thanks a lot, and regards, Shuvam cyrus likes to mmap the whole file so it can just offset into it to extract which ever part is requested. In IMAP, you can request any arbitrary byte range from the raw RFC822 message using the body[]start.length construct, so you have to be able to byte accurately reconstruct the original email if you remove attachments. Consider the problem of transfer encoding. Say you have a base64 encoded attachment (which basically all are). When storing and deduping, you'd want to base64 decode it to get the underlying binary data. But depending on the line length of the base64 encoded data, the same file can be encoded in a large number of different ways. When you reconstruct the base64 data, you have to be byte accurate in your reconstruction so your offsets are correct, and so any signing of the message (eg DKIM) isn't broken. Once you've solved those problems, the rest is pretty straight forward :) Rob Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
Dear Bron, http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413 2TB - US $109. Don't want to nit-pick here, but the effective price we pay is about ten times this. To set up a mail server with a few TB of disk space, we usually land up deploying a separate chassis with RAID controllers and a RAID array, with FC connections from servers, etc, etc. All this adds up to about $1,000/TB of usable space if you're using something like the low-end IBM DS3400 box or Dell/EMC equivalent. This is even with inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives. http://www-07.ibm.com/storage/in/disk/ds3000/ds3400/ And most of our customers actually double this cost because they keep two physically identical chassis for redundancy. (We recommend this too, because we can't trust a single RAID 5 array to withstand controller or PSU failures.) In that case, it's $2000/TB. And you do reach 5-10 TB of email store quite rapidly --- our company has many corporate clients ( 500 email users) whose IMAP store has reached 4TB. No one wants to enforce disk quotas (corporate policy), and most users don't want to delete emails on their own. We keep hearing the logic that storage is cheap, and stories of cloud storage through Amazon, unlimited mailboxes on Gmail, are reinforcing the belief. But at the ground level in mid-market corporate IT budgets, storage costs in data centres (as against inside desktops) are still too high to be trivial, and their prices have only little to do with the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little over $12,000 in India, with a full set of 1TB SATA-II drives from IBM, but even with high cost of IBM drives, the drives themselves contribute less than 30% of the total cost. If we really want to put our collective money where our mouth is, and deliver the storage-is-cheap promise at the ground level, we need to rearchitect every file server and IMAP server to work in map-reduce mode and use disks inside desktops. Anyone game for this project? :) Now de-duping messages on copy is valuable, not so much because of the space it saves, but because of the IO it saves. Copying the file around is expensive. De-duping componenets of messages and then reconstructing? Not so much. You'll be causing MORE IO in general looking for the message, finding the parts. I agree. My aim was not to reduce IOPS but to cut disk space usage. There are two areas where we are seeing a huge increase in inactive disk utilisation for emails. One is for the archive, which is being kept for security and compliance reasons. Every company we work with wants an archive with at least a few years' retention. They search the archive every few weeks to trace lost emails, not for compliance reasons but to find missing information. This means that we can't ask them to move the data out to removable storage. The second area is shared mail folders where all communication with each client/topic/project are stored practically forever. A 500-user company can easily acquire an email archive of 2-5TB. I don't care how much the IO load of that archive server increases, but I'd like to reduce disk space utilisation. If the customer can stick to 2TB of space requirements, he can use a desktop with two 2TB drives in RAID 1, and get a real cheap archive server. If this figure reaches 3-4TB, he goes into a separate RAID chassis --- the hardware cost goes up 5-10 times. These are tradeoffs a lot of small to mid-sized companies in my market fuss about. And in a more generic context, I am seeing that all kinds of intelligent de-duping of infrequently-accessed data is going to become the crying need of every mid-sized and large company. Data is growing too fast, and no one wants to impose user discipline or data cleaning. When we tell the business head This is crazy!, he turns around and tells the CTO But disk space is cheap! Haven't you heard of Google? What are you cribbing about? You must be doing something really inefficient here, wasting money! thanks and regards, Shuvam Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/
Re: De-duping attachments
A 500-user company can easily acquire an email archive of 2-5TB. I don't care how much the IO load of that archive server increases, but I'd like to reduce disk space utilisation. If the customer can stick to 2TB of It would be interesting to measure the amount of duplication that is going on with attachments in emails. While we could do that with Fastmail data, I think because of the broad range of users, we'd be getting one data point, which might be quite different to a data point inside one company. Eg. An architectural firm might end up sending big blueprint documents back and forth between each other a lot, so they'd gain a lot from deduplication. Also even within deduplication, there's some interesting ideas as well. For instance, if you know the same file is being sent back and forth a lot with minor changes, you might want to store the most recent version, and store binary diffs between the most recent and old versions (eg xdelta). Yes accessing the older versions would be much slower (have to get most recent + apply N deltas), but the space savings could be huge. Can we just brainstorm with you and others in this thread... how do we re-create a byte-identical attachment from a disk file? One overall implementation issue. With the message file, do you: 1. Completely rewrite the message file removing the attachments and adding any extra meta data you want in it's place 2. Leave the message file as exactly the same size, just don't write out the attachment content and assume your filesystem supports sparse files (http://en.wikipedia.org/wiki/Sparse_file) The advantage of 2 is that it leaves the message file size correct, and all the offsets in the file are still correct. The downsides are that you must ensure your FS supports sparse files well, and there's the question of where do you actually store the information that links to the external file? - file name/reference - full MIME header of the attachment block I'd leave these intact in the actual message, and just add an extra X-Detached-File header or something like that includes some external reference to the file. Hmmm, that'll break signing though. Not so easy... - separator string (this will be retained in the message body anyway) - transfer encoding - if encoding = base64 then base64 line length Remember every line can actually be a different length! In most cases they will be the same length, but you can't assume it. And you do see messages that have lines in repeating groups like 76, 76, 76, 76, 74, 76, 76, 76, 76, 74, ... repeat ... or cases like that, a pain to deal with. - checksum of encoded attachment (as a sanity check in case the re-encoding fails to recreate exactly the same image as the original) This is seeming a bit more tricky... Rob Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/