Re: De-duping attachments

2010-09-16 Thread Marc Patermann
Hi,

Shuvam Misra schrieb am 15.09.2010 03:40 Uhr:

 How difficult or easy would it be to modify Cyrus to strip all
 attachments from emails and store them separately in files? In the
 message file, replace the attachment with a special tag which will point
 to the attachment file. Whenever the message is fetched for any reason,
 the original MIME-encoded message will be re-constructed and delivered.
mh, OK, I'm not sure if I should post this here, but the is another IMAP 
server (some say: the new kid on the block) and as I came above it a 
few day ago, they are working on the same thing:

http://blog.dovecot.org/2010/07/single-instance-attachment-storage.html

I did not follow the discussion, but at least there is a discussion 
about this feature elsewhere.


Marc

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 08:40:59AM +0530, Shuvam Misra wrote:
 Dear Rob,
 
 I had reservations about some of these things too. :( In particular,
 I was wondering about having to remember and recreate the exact
 transfer-encoding. If both of us forward the same attachment in two
 emails, and one encodes in quoted-printable, the other in base64, Cyrus
 had better be able to recreate them exactly or have some other
 workarounds.
 
 I wasn't aware of the mmap() usage and the direct seeking into the middle
 of the message body. But the bigger problem is what you've described about
 reproducing the message byte-identically. If that can be solved, then we
 can make Cyrus re-create the message while loading from disk and stick it
 into RAM.

There's not actually THAT much parsing of the message body.  I would
guess it's about 9 places:

imap/cyrdump.c
250:r = mailbox_map_message(state-mailbox, uids[i], base, len);

imap/index.c
1013:if (mailbox_map_message(mailbox, im-record.uid,
1535:if (mailbox_map_message(mailbox, im-record.uid, msg_base, 
msg_size)) 
2441:   if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size)) 
{
2716:if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size))
3152:   if (mailbox_map_message(mailbox, im-record.uid,
3337:  if (mailbox_map_message(mailbox, uid, msgfile.base, msgfile.size)) {
5112:   if (mailbox_map_message(mailbox, im-record.uid, msg_base, msg_size))

(those 8 plus one in imap/message.c where it gets parsed originally)
 
 Can we just brainstorm with you and others in this thread...  how do we
 re-create a byte-identical attachment from a disk file?  What is the list
 of attributes we will need to store per stripped attachment to allow an
 exact re-creation?

I did a bunch of work on this a while back.  Basically for the byte
idential reverse, as I said - keep a list of the most common mapping
functions and try to figure out which one it is algorithmically.
In theory we can work out what the common ones are pretty fast.

   - file name/reference
 
   - full MIME header of the attachment block
 
   - separator string (this will be retained in the message body anyway)
 
   - transfer encoding

All this stuff I'd keep as a binary diff from the nearly right
re-encoding.

   - if encoding = base64 then
 base64 line length

Yeah, that's an interesting one.  Assuming it's not totally pathological
there will be some base64 pattern you can find quickly.

   - checksum of encoded attachment (as a sanity check in case the re-encoding
 fails to recreate exactly the same image as the original)

We like sha1s.

 If encoding = quoted-printable or uuencode, then don't strip the
 attachment at all.

Makes sense.  There might be some size based logic here too - only
bother applying this on messages over 20k, and where the attachment
is at least 20k in size.  Anything smaller than that is pretty
pointless.

 What other conditions may we need to look for to bypass attachment
 stripping?
 
 Can we just tap into all of you to get the ideas on paper, even if
 it's not being implemented by anyone right now?  It'll at least help us
 understand the system's internals better.

Sure.  Ideas are good :)  I don't think I'm sold on the value though.
And given that Rob is actually the one who argued me down from
implementing this years ago ;)  But maybe our use case isn't the
same as yours.

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote:
 Dear Bron,
 
  http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413
  
  2TB - US $109.
 
 Don't want to nit-pick here, but the effective price we pay is about
 ten times this.

Yeah, so?  It's going down.  That's a large number of attachments
we're talking about there.

 To set up a mail server with a few TB of disk space,
 we usually land up deploying a separate chassis with RAID controllers and
 a RAID array, with FC connections from servers, etc, etc.  All this adds
 up to about $1,000/TB of usable space if you're using something like the
 low-end IBM DS3400 box or Dell/EMC equivalent. This is even with
 inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives.

Hmm... our storage units with metadata on SSD come in about $1200/TB.
Yes, that sounds about right.  That's including hot spares, RAID1 on
everything (including the SSDs), scads of processor and memory.
Obviously multiply that by two for replication, and add in a bit of
extra for backups and I'm happy to arrive at a figure of approximately
$3000 per terabyte of actual email.

 And most of our customers actually double this cost because they keep two
 physically identical chassis for redundancy. (We recommend this too,
 because we can't trust a single RAID 5 array to withstand controller or
 PSU failures.) In that case, it's $2000/TB.

And because it's nice not to have downtime when you're doing
maintainence.  I replaced an entire drive unit today, including
about 4 hours downtime on one of our servers as the system was
swamped with IO creating new filesystems and initialising the
drives.   The users didn't see a thing, and repliation is now
fully operational again.

 And you do reach 5-10 TB of email store quite rapidly --- our company
 has many corporate clients ( 500 email users) whose IMAP store has
 reached 4TB. No one wants to enforce disk quotas (corporate policy),
 and most users don't want to delete emails on their own.

So you save, what, 50%.  Does that sound about right?  Do you have
statistics on how much space you'd save with this theoretical
patch?
 
 We keep hearing the logic that storage is cheap, and stories of cloud
 storage through Amazon, unlimited mailboxes on Gmail, are reinforcing
 the belief. But at the ground level in mid-market corporate IT budgets,
 storage costs in data centres (as against inside desktops) are still
 too high to be trivial, and their prices have only little to do with
 the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little
 over $12,000 in India, with a full set of 1TB SATA-II drives from IBM,
 but even with high cost of IBM drives, the drives themselves contribute
 less than 30% of the total cost.

You're buying a few months.  Usage grows to fill the available storage,
whatever it is.  And you can only pull this piece of magic once.

 If we really want to put our collective money where our mouth is, and
 deliver the storage-is-cheap promise at the ground level, we need to
 rearchitect every file server and IMAP server to work in map-reduce mode
 and use disks inside desktops. Anyone game for this project? :)

You could buy as much benefit much more quickly by gzipping the
individual email files.  Either a filesystem that stores files
compressed, or a cyrus patch to do that and unpack files on the
fly if the body was read.  Along with most/all headers in the
cyrus.cache file, the body doesn't get opened very often.  Man
a body search would hurt though!

  Now de-duping messages on copy is valuable, not so much because of
  the space it saves, but because of the IO it saves.  Copying the file
  around is expensive.
  
  De-duping componenets of messages and then reconstructing?  Not so much.
  You'll be causing MORE IO in general looking for the message, finding the
  parts.
 
 I agree. My aim was not to reduce IOPS but to cut disk space usage.

IOPS matter too.  Depending on your usage patterns obviously.  If you
don't ever get body searches on your server they probably matter less.

 A 500-user company can easily acquire an email archive of 2-5TB. I don't
 care how much the IO load of that archive server increases, but I'd like
 to reduce disk space utilisation. If the customer can stick to 2TB of
 space requirements, he can use a desktop with two 2TB drives in RAID
 1, and get a real cheap archive server. If this figure reaches 3-4TB,
 he goes into a separate RAID chassis --- the hardware cost goes up 5-10
 times. These are tradeoffs a lot of small to mid-sized companies in my
 market fuss about.

Sounds like a case for a cheaper RAID chassis to me.  Or actually
cleaning up a little.  While I appreciate the tradeoff, I think
they'll still fill up pretty quickly even with this.  It's a
short term stop-gap measure.
 
 And in a more generic context, I am seeing that all kinds of intelligent
 de-duping of infrequently-accessed data is going to become the crying
 need of every mid-sized and large company.  Data 

Re: De-duping attachments

2010-09-15 Thread Simon Matter
 On Wed, Sep 15, 2010 at 09:15:13AM +0530, Shuvam Misra wrote:
 Dear Bron,

  http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413
 
  2TB - US $109.

 Don't want to nit-pick here, but the effective price we pay is about
 ten times this.

 Yeah, so?  It's going down.  That's a large number of attachments
 we're talking about there.

 To set up a mail server with a few TB of disk space,
 we usually land up deploying a separate chassis with RAID controllers
 and
 a RAID array, with FC connections from servers, etc, etc.  All this adds
 up to about $1,000/TB of usable space if you're using something like the
 low-end IBM DS3400 box or Dell/EMC equivalent. This is even with
 inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives.

 Hmm... our storage units with metadata on SSD come in about $1200/TB.
 Yes, that sounds about right.  That's including hot spares, RAID1 on
 everything (including the SSDs), scads of processor and memory.
 Obviously multiply that by two for replication, and add in a bit of
 extra for backups and I'm happy to arrive at a figure of approximately
 $3000 per terabyte of actual email.

 And most of our customers actually double this cost because they keep
 two
 physically identical chassis for redundancy. (We recommend this too,
 because we can't trust a single RAID 5 array to withstand controller or
 PSU failures.) In that case, it's $2000/TB.

 And because it's nice not to have downtime when you're doing
 maintainence.  I replaced an entire drive unit today, including
 about 4 hours downtime on one of our servers as the system was
 swamped with IO creating new filesystems and initialising the
 drives.   The users didn't see a thing, and repliation is now
 fully operational again.

 And you do reach 5-10 TB of email store quite rapidly --- our company
 has many corporate clients ( 500 email users) whose IMAP store has
 reached 4TB. No one wants to enforce disk quotas (corporate policy),
 and most users don't want to delete emails on their own.

 So you save, what, 50%.  Does that sound about right?  Do you have
 statistics on how much space you'd save with this theoretical
 patch?

 We keep hearing the logic that storage is cheap, and stories of cloud
 storage through Amazon, unlimited mailboxes on Gmail, are reinforcing
 the belief. But at the ground level in mid-market corporate IT budgets,
 storage costs in data centres (as against inside desktops) are still
 too high to be trivial, and their prices have only little to do with
 the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little
 over $12,000 in India, with a full set of 1TB SATA-II drives from IBM,
 but even with high cost of IBM drives, the drives themselves contribute
 less than 30% of the total cost.

 You're buying a few months.  Usage grows to fill the available storage,
 whatever it is.  And you can only pull this piece of magic once.

 If we really want to put our collective money where our mouth is, and
 deliver the storage-is-cheap promise at the ground level, we need to
 rearchitect every file server and IMAP server to work in map-reduce mode
 and use disks inside desktops. Anyone game for this project? :)

 You could buy as much benefit much more quickly by gzipping the
 individual email files.  Either a filesystem that stores files
 compressed, or a cyrus patch to do that and unpack files on the
 fly if the body was read.  Along with most/all headers in the

I guess much more efficient than a compressing filesystem would be a
compressing and de-duping filesystem or disk storage in this case. Has
anyone tried this with a Cyrus message store with lots of corporate
message data stored on it?

Simon


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Eric Luyten
On Wed, September 15, 2010 10:01 am, Simon Matter wrote:

 I guess much more efficient than a compressing filesystem would be a
 compressing and de-duping filesystem or disk storage in this case. Has anyone
 tried this with a Cyrus message store with lots of corporate message data
 stored on it?


Simon,


The Cyrus server I hope to get online tomorrow evening holds 4.2 TB of mail
and uses ZFS with maximal compression (gzip9) for the message files.
(OS : Solaris 10)

ZFS reports a compressratio of between 1.95 and 1.97 (we have nine partitions)

A series of tests revealed our metadata can actually be compressed by a factor
of 3.76 (!)

Perhaps a two-university environment with 60,000+ users doesn't quite qualify
as corporate enough but here you have our figures :-)


Regards,
Eric.



Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Shuvam Misra
Dear Bron,

 So you save, what, 50%.  Does that sound about right?  Do you have
 statistics on how much space you'd save with this theoretical
 patch?

No, and this is the first thing I want to do. I'm getting some simple
utilities developed which will run all week (niced suitably) and extract
and MD5sum each attachment. I'll then count how many unique message-IDs
have the same unique document, and I'll get a report. This has been under
discussion in our group for some time --- let me get this done and I'll
let all of you know.

 You're buying a few months.  Usage grows to fill the available storage,
 whatever it is.  And you can only pull this piece of magic once.

Unfortunately, you're totally right. The junk will keep growing.

Shuvam

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Shuvam Misra

The sparse file idea is brilliant! Never occurred to me. :)

We'd have to store the reference-pointer in the message file, so we would
omit the actual attachment but eat up perhaps 50 bytes to keep the
reference to the file.

Shuvam

 1. Completely rewrite the message file removing the attachments and
 adding any extra meta data you want in it's place
 2. Leave the message file as exactly the same size, just don't write
 out the attachment content and assume your filesystem supports
 sparse files (http://en.wikipedia.org/wiki/Sparse_file)
 
 The advantage of 2 is that it leaves the message file size correct,
 and all the offsets in the file are still correct. The downsides are
 that you must ensure your FS supports sparse files well, and there's
 the question of where do you actually store the information that
 links to the external file?

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Shuvam Misra
 Makes sense.  There might be some size based logic here too - only
 bother applying this on messages over 20k, and where the attachment
 is at least 20k in size.  Anything smaller than that is pretty
 pointless.

Yes, absolutely. Left to myself, I'd not have bothered with any
attachment less than 100KBytes or so. The stuff that gets my goat is
seeing our customers using email to shunt 20MB CAD files back and forth
across the world two dozen times. Emails are being used for the kind of
work God had meant trucks to do. :(

 Sure.  Ideas are good :)  I don't think I'm sold on the value though.
 And given that Rob is actually the one who argued me down from
 implementing this years ago ;)  But maybe our use case isn't the
 same as yours.

Let me get some hard data from a few of our large corporate clients'
servers, and then we'll talk again. May take a couple of weeks to get
this data, because we'll need to look for a time window when the mail
server is less loaded to run our scan.

Shuvam

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Nik Conwell
  Great thread.  Here as some real world numbers based on our spools 
here at BU.

One of our masters has 4,800 users, 22,000 mailboxes, and is using about 
374G of disk.

Based on the md5 files for these users there are 6,046,363 messages.  If 
I look at the first md5 value (md5 on the msg if I understand this) and 
sort and uniq I get 5,891,974 messages, so assuming we dedup all those 
messages that would be a shrink to 97.4% of the original number of 
messages.  Assuming an even distribution of message sizes this would 
mean 374G would drop down to 362.78G.  Unfortunately not an obvious huge 
win.

But, I think the md5 of the message file includes headers which may be 
more likely to be unique over the body content.  (Due to legacy support 
for UW IMAP, we often end up routing things differently for users on the 
same master so the headers for the same message sent to 2 people could 
be different).

Isn't the easy hack for dedup just looking at the above md5 files and 
then doing appropriate hard links?  This could be done by a nightly 
trawl of the spool space.  A bigger win would be to separate the headers 
from the messages but that's a lot more work.

-nik


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Simon Matter
 On Wed, September 15, 2010 10:01 am, Simon Matter wrote:

 I guess much more efficient than a compressing filesystem would be a
 compressing and de-duping filesystem or disk storage in this case. Has
 anyone
 tried this with a Cyrus message store with lots of corporate message
 data
 stored on it?


 Simon,


 The Cyrus server I hope to get online tomorrow evening holds 4.2 TB of
 mail
 and uses ZFS with maximal compression (gzip9) for the message files.
 (OS : Solaris 10)

 ZFS reports a compressratio of between 1.95 and 1.97 (we have nine
 partitions)

 A series of tests revealed our metadata can actually be compressed by a
 factor
 of 3.76 (!)

 Perhaps a two-university environment with 60,000+ users doesn't quite
 qualify
 as corporate enough but here you have our figures :-)

Eric, that looks of course interesting. With more corporate style I
means much less users but much bigger mailboxes. Enforcing quota in the
mulit GB range seems quite common these days. In such environment I expect
the compression ratio to increase.
But, the big question for me is how much filesystem / block level deduping
is going to shrink it? You said ZFS, did you consider testing its built in
deduping? (If its even there in Solaris 10?)

Simon


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Eric Luyten
On Wed, September 15, 2010 2:12 pm, Simon Matter wrote:

 You said ZFS, did you
 consider testing its built in deduping?
 (If its even there in Solaris 10?)

Simon,


OpenSolaris does have it (block level dedup) since about one year
but it is too recent an addition to the commercial Solaris 10 to
start using it (IMO). Apparently (Wikipedia) it is ZFS pool feature
21 listed as 'Reserved' by 'zpool upgrade -v'
(h... both 'zfs get all ...' and 'zpool get all ...' do not
 yield a parameter sounding as 'deduplication' ; it may very well
 not be there yet)

Furthermore, I'd like to repeat what has been written earlier in
this thread : a message header that is different in size by even
one byte will cause block boundaries to shift and, I suspect, block
level dedup to fail.


Eric Luyten.


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Joseph Brennan

Outside the cyrus box:  The Mimedefang milter has a built-in function
(optional of course) to remove an attachment, write it to a file, and
replace the attachment part with a text part giving a web link to the
file.  The files could be on a slower type of disk drive than you need
for email storage.  You could write code choosing which attachments to
do this to, say by size or file extension.  A mechanism to remove the
files is not provided, but it's suggested that recipients would need
to download the attachment to their own computer and that therefore
the files could be deleted by a cron job based on age.

I mention this only as another way to do it.  Note that this could be
implemented for outgoing mail too.  We have not implemented it here
so I can't say more than that it is possible.

Joseph Brennan
Columbia University Information Technology


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Gavin McCullagh
Hi,

On Wed, 15 Sep 2010, Nik Conwell wrote:

 Isn't the easy hack for dedup just looking at the above md5 files and 
 then doing appropriate hard links?  This could be done by a nightly 
 trawl of the spool space.  A bigger win would be to separate the headers 
 from the messages but that's a lot more work.

For what it's worth, I believe the fsdup tool which is part of fslint will
do this for you.

http://www.pixelbeat.org/fslint/

Gavin



Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Patrick Goetz
On 09/14/2010 11:55 PM, Rob Mueller wrote:

 Eg. An architectural firm
 might end up sending big blueprint documents back and forth between each
 other a lot, so they'd gain a lot from deduplication.


Not to throw a damp towel on this discussion, but isn't this really an 
administrative problem rather than a technical one?  I.e. shouldn't the 
system administrator set up a version control system or even something 
like dropbox for file sharing rather than using email for this situation?

  if you know the same file is being sent back and forth a lot with
  minor changes, you might want to store the most recent version,
  and store binary diffs between the most recent and old versions
  (eg xdelta). Yes accessing the older versions would be much
  slower (have to get most recent +
  apply N deltas), but the space savings could be huge.


My users frequently mail documents to the person in the office next door 
(never mind that both their home directories are on the same server!); 
however this content is almost always different for each attached file; 
i.e. without re-implementing a version control system under IMAP, as 
you're suggesting, there would be little benefit in keeping and hard 
linking to a single copy of each file.  However, that seems like it 
fails the UNIX do one thing, and do it well test pretty badly.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-15 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 05:24:11PM +0100, Gavin McCullagh wrote:
 Hi,
 
 On Wed, 15 Sep 2010, Nik Conwell wrote:
 
  Isn't the easy hack for dedup just looking at the above md5 files and 
  then doing appropriate hard links?  This could be done by a nightly 
  trawl of the spool space.  A bigger win would be to separate the headers 
  from the messages but that's a lot more work.
 
 For what it's worth, I believe the fsdup tool which is part of fslint will
 do this for you.
 
   http://www.pixelbeat.org/fslint/

Or this lovely little toy.  It uses the fact that in current versions of
Cyrus the GUID field is actually the sha1 of the underlying file.

Bron ( warning: may contain FastMail specific assuptions )
#!/usr/bin/perl -w

# SETUP {{{
use strict;
use warnings;
BEGIN { do /home/mod_perl/hm/ME/FindLibs.pm; }
use Date::Manip;
use MailApp::Admin::Actions;
use IO::File;
use ME::Machine;
use Cyrus::HeaderFile;
use Data::Dumper;
use Cyrus::IndexFile;
use Getopt::Std;
use Digest::SHA1;
use ME::CyrusBackup;
use ME::User;
use Data::Dumper;
# }}}

my $sn = shift;

my (undef,undef,$uid,$gid) = getpwnam('cyrus');

foreach my $Slot (ME::Machine-ImapSlots()) {
  next if ($sn and $sn ne $Slot-Name());
  my $users = $Slot-AllMailboxes();
  my $conf = $Slot-ImapdConf();
  foreach my $user (sort keys %$users) {
process($conf, $user, $users-{$user});
  }
}

sub process {
  my ($conf, $user, $folders) = @_;
  print $user\n;
  my %ihave;
  foreach my $folder (@$folders) {
my $meta = $conf-GetUserLocation('meta', $user, 'default', $folder);
my $index = Cyrus::IndexFile-new_file($meta/cyrus.index) || die Failed to open $meta/cyrus.index;
while (my $record = $index-next_record()) {
  push @{$ihave{$record-{MessageGuid}}}, [$folder, $record-{Uid}];
}
  }

  foreach my $guid (keys %ihave) {
next if @{$ihave{$guid}} = 1;
my ($inode, $srcname);
my @others;
foreach my $item (@{$ihave{$guid}}) {
  my $spool = $conf-GetUserLocation('spool', $user, 'default', $item-[0]);
  $spool =~ s{/$}{};
  my $file = $spool/$item-[1].;
  my (@sd) = stat($file);
  if ($inode) {
next if $sd[1] == $inode;
push @others, $file;
  }
  else {
$inode = $sd[1];
$srcname = $file;
  }
}
next unless @others;
print fixing up files for $guid ($srcname)\n;
foreach my $file (@others) {
  my $tmpfile = $file . tmp;
  print link error $tmpfile\n unless link($srcname, $tmpfile);
  chown($uid, $gid, $tmpfile);
  chmod(0600, $tmpfile);
  print rename error $file\n unless rename($tmpfile, $file);
}
  }
}

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/

Re: De-duping attachments

2010-09-14 Thread Rob Mueller

 How difficult or easy would it be to modify Cyrus to strip all
 attachments from emails and store them separately in files? In the
 message file, replace the attachment with a special tag which will point
 to the attachment file. Whenever the message is fetched for any reason,
 the original MIME-encoded message will be re-constructed and delivered.

Like anything, doable, but quite a lot of work.

cyrus likes to mmap the whole file so it can just offset into it to extract 
which ever part is requested. In IMAP, you can request any arbitrary byte 
range from the raw RFC822 message using the body[]start.length construct, 
so you have to be able to byte accurately reconstruct the original email if 
you remove attachments.

Consider the problem of transfer encoding. Say you have a base64 encoded 
attachment (which basically all are). When storing and deduping, you'd want 
to base64 decode it to get the underlying binary data. But depending on the 
line length of the base64 encoded data, the same file can be encoded in a 
large number of different ways. When you reconstruct the base64 data, you 
have to be byte accurate in your reconstruction so your offsets are correct, 
and so any signing of the message (eg DKIM) isn't broken.

Once you've solved those problems, the rest is pretty straight forward :)

Rob


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-14 Thread Bron Gondwana
On Wed, Sep 15, 2010 at 12:13:03PM +1000, Rob Mueller wrote:
 
  How difficult or easy would it be to modify Cyrus to strip all
  attachments from emails and store them separately in files? In the
  message file, replace the attachment with a special tag which will point
  to the attachment file. Whenever the message is fetched for any reason,
  the original MIME-encoded message will be re-constructed and delivered.

http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413

2TB - US $109.

 Like anything, doable, but quite a lot of work.

Now de-duping messages on copy is valuable, not so much because of
the space it saves, but because of the IO it saves.  Copying the file
around is expensive.

De-duping componenets of messages and then reconstructing?  Not so much.
You'll be causing MORE IO in general looking for the message, finding the
parts.

The only real benefit I can see is something like replication or a
client that's downloading multiple of these large messages and wants
to save network bandwidth.

Except - there's no protocol to support this for client, so only
replication could gain.
 
 cyrus likes to mmap the whole file so it can just offset into it to extract 
 which ever part is requested. In IMAP, you can request any arbitrary byte 
 range from the raw RFC822 message using the body[]start.length construct, 
 so you have to be able to byte accurately reconstruct the original email if 
 you remove attachments.
 
 Consider the problem of transfer encoding. Say you have a base64 encoded 
 attachment (which basically all are). When storing and deduping, you'd want 
 to base64 decode it to get the underlying binary data. But depending on the 
 line length of the base64 encoded data, the same file can be encoded in a 
 large number of different ways. When you reconstruct the base64 data, you 
 have to be byte accurate in your reconstruction so your offsets are correct, 
 and so any signing of the message (eg DKIM) isn't broken.
 
 Once you've solved those problems, the rest is pretty straight forward :)

Yeah, they really aren't so hard to solve.  I didn't actually do the research,
but I have an idea what to do.  Find a big corpus of emails (i.e. FastMail's
one!) and figure out the 10-20 most common base64 widths and surrounding
layouts.  Choose one of those and store it by a single it's this layout.
If none of them match exactly, store a binary diff from the closest one as
well, it probably won't be very huge.

But in general, I'd say you're optimising the wrong problem.  It's just not
worth it, the savings are minimal and the added complexity is high.  Disk
space is now cheap, and fast access via a cached copy of the email will
beat re-creating the original file from mime parts hands down.

Bron.

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-14 Thread Shuvam Misra
Dear Rob,

I had reservations about some of these things too. :( In particular,
I was wondering about having to remember and recreate the exact
transfer-encoding. If both of us forward the same attachment in two
emails, and one encodes in quoted-printable, the other in base64, Cyrus
had better be able to recreate them exactly or have some other
workarounds.

I wasn't aware of the mmap() usage and the direct seeking into the middle
of the message body. But the bigger problem is what you've described about
reproducing the message byte-identically. If that can be solved, then we
can make Cyrus re-create the message while loading from disk and stick it
into RAM.

Can we just brainstorm with you and others in this thread...  how do we
re-create a byte-identical attachment from a disk file?  What is the list
of attributes we will need to store per stripped attachment to allow an
exact re-creation?

  - file name/reference

  - full MIME header of the attachment block

  - separator string (this will be retained in the message body anyway)

  - transfer encoding

  - if encoding = base64 then
base64 line length

  - checksum of encoded attachment (as a sanity check in case the re-encoding
fails to recreate exactly the same image as the original)

If encoding = quoted-printable or uuencode, then don't strip the
attachment at all.

What other conditions may we need to look for to bypass attachment
stripping?

Can we just tap into all of you to get the ideas on paper, even if
it's not being implemented by anyone right now?  It'll at least help us
understand the system's internals better.

thanks a lot, and regards,
Shuvam

 cyrus likes to mmap the whole file so it can just offset into it to
 extract which ever part is requested. In IMAP, you can request any
 arbitrary byte range from the raw RFC822 message using the
 body[]start.length construct, so you have to be able to byte
 accurately reconstruct the original email if you remove attachments.
 
 Consider the problem of transfer encoding. Say you have a base64
 encoded attachment (which basically all are). When storing and
 deduping, you'd want to base64 decode it to get the underlying
 binary data. But depending on the line length of the base64 encoded
 data, the same file can be encoded in a large number of different
 ways. When you reconstruct the base64 data, you have to be byte
 accurate in your reconstruction so your offsets are correct, and so
 any signing of the message (eg DKIM) isn't broken.
 
 Once you've solved those problems, the rest is pretty straight forward :)
 
 Rob
 

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-14 Thread Shuvam Misra
Dear Bron,

 http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413
 
 2TB - US $109.

Don't want to nit-pick here, but the effective price we pay is about
ten times this. To set up a mail server with a few TB of disk space,
we usually land up deploying a separate chassis with RAID controllers and
a RAID array, with FC connections from servers, etc, etc.  All this adds
up to about $1,000/TB of usable space if you're using something like the
low-end IBM DS3400 box or Dell/EMC equivalent. This is even with
inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives.

http://www-07.ibm.com/storage/in/disk/ds3000/ds3400/

And most of our customers actually double this cost because they keep two
physically identical chassis for redundancy. (We recommend this too,
because we can't trust a single RAID 5 array to withstand controller or
PSU failures.) In that case, it's $2000/TB.

And you do reach 5-10 TB of email store quite rapidly --- our company
has many corporate clients ( 500 email users) whose IMAP store has
reached 4TB. No one wants to enforce disk quotas (corporate policy),
and most users don't want to delete emails on their own.

We keep hearing the logic that storage is cheap, and stories of cloud
storage through Amazon, unlimited mailboxes on Gmail, are reinforcing
the belief. But at the ground level in mid-market corporate IT budgets,
storage costs in data centres (as against inside desktops) are still
too high to be trivial, and their prices have only little to do with
the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little
over $12,000 in India, with a full set of 1TB SATA-II drives from IBM,
but even with high cost of IBM drives, the drives themselves contribute
less than 30% of the total cost.

If we really want to put our collective money where our mouth is, and
deliver the storage-is-cheap promise at the ground level, we need to
rearchitect every file server and IMAP server to work in map-reduce mode
and use disks inside desktops. Anyone game for this project? :)

 Now de-duping messages on copy is valuable, not so much because of
 the space it saves, but because of the IO it saves.  Copying the file
 around is expensive.
 
 De-duping componenets of messages and then reconstructing?  Not so much.
 You'll be causing MORE IO in general looking for the message, finding the
 parts.

I agree. My aim was not to reduce IOPS but to cut disk space usage.

There are two areas where we are seeing a huge increase in inactive
disk utilisation for emails. One is for the archive, which is being kept
for security and compliance reasons. Every company we work with wants an
archive with at least a few years' retention. They search the archive
every few weeks to trace lost emails, not for compliance reasons but to
find missing information. This means that we can't ask them to move the
data out to removable storage.

The second area is shared mail folders where all communication with each
client/topic/project are stored practically forever.

A 500-user company can easily acquire an email archive of 2-5TB. I don't
care how much the IO load of that archive server increases, but I'd like
to reduce disk space utilisation. If the customer can stick to 2TB of
space requirements, he can use a desktop with two 2TB drives in RAID
1, and get a real cheap archive server. If this figure reaches 3-4TB,
he goes into a separate RAID chassis --- the hardware cost goes up 5-10
times. These are tradeoffs a lot of small to mid-sized companies in my
market fuss about.

And in a more generic context, I am seeing that all kinds of intelligent
de-duping of infrequently-accessed data is going to become the crying
need of every mid-sized and large company.  Data is growing too fast,
and no one wants to impose user discipline or data cleaning. When we
tell the business head This is crazy!, he turns around and tells the
CTO But disk space is cheap! Haven't you heard of Google? What are you
cribbing about? You must be doing something really inefficient here,
wasting money!

thanks and regards,
Shuvam

Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/


Re: De-duping attachments

2010-09-14 Thread Rob Mueller

 A 500-user company can easily acquire an email archive of 2-5TB. I don't
 care how much the IO load of that archive server increases, but I'd like
 to reduce disk space utilisation. If the customer can stick to 2TB of

It would be interesting to measure the amount of duplication that is going 
on with attachments in emails.

While we could do that with Fastmail data, I think because of the broad 
range of users, we'd be getting one data point, which might be quite 
different to a data point inside one company. Eg. An architectural firm 
might end up sending big blueprint documents back and forth between each 
other a lot, so they'd gain a lot from deduplication.

Also even within deduplication, there's some interesting ideas as well. For 
instance, if you know the same file is being sent back and forth a lot with 
minor changes, you might want to store the most recent version, and store 
binary diffs between the most recent and old versions (eg xdelta). Yes 
accessing the older versions would be much slower (have to get most recent + 
apply N deltas), but the space savings could be huge.

 Can we just brainstorm with you and others in this thread...  how do we
 re-create a byte-identical attachment from a disk file?

One overall implementation issue. With the message file, do you:

1. Completely rewrite the message file removing the attachments and adding 
any extra meta data you want in it's place
2. Leave the message file as exactly the same size, just don't write out the 
attachment content and assume your filesystem supports sparse files 
(http://en.wikipedia.org/wiki/Sparse_file)

The advantage of 2 is that it leaves the message file size correct, and all 
the offsets in the file are still correct. The downsides are that you must 
ensure your FS supports sparse files well, and there's the question of where 
do you actually store the information that links to the external file?

  - file name/reference
  - full MIME header of the attachment block

I'd leave these intact in the actual message, and just add an extra 
X-Detached-File header or something like that includes some external 
reference to the file. Hmmm, that'll break signing though. Not so easy...

  - separator string (this will be retained in the message body anyway)
  - transfer encoding
  - if encoding = base64 then
base64 line length

Remember every line can actually be a different length! In most cases they 
will be the same length, but you can't assume it. And you do see messages 
that have lines in repeating groups like 76, 76, 76, 76, 74, 76, 76, 76, 76, 
74, ... repeat ... or cases like that, a pain to deal with.

  - checksum of encoded attachment (as a sanity check in case the 
 re-encoding
fails to recreate exactly the same image as the original)

This is seeming a bit more tricky...

Rob


Cyrus Home Page: http://www.cyrusimap.org/
List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/