Sounds like the only real extra storage for 100 messages that vary only by different To: headers isi the cost of storing 100 extra copies of the full headers. That is good news to me.
Should I change the Message-ID: header on each copy? How does DBMail use the Message-ID: header? Thanks for the detailed explanation of how de-duplication works… Kevin On Apr 16, 2014, at 4:44 PM, Paul J Stevens <[email protected]> wrote: > On 13-04-14 22:19, KT Walrus wrote: >> DBMail already does a lot of data deduplication (headers, attachments, >> etc.). I’m just not clear how far this goes and whether my turning a >> message to a list of recipients into multiple copies of the message with >> different To: and possibly different Message-Id: affects the data >> de-duplication. >> >> If I should keep the headers the same for all copies of the message to get >> maximum data deduplication, I will. I just prefer each recipient see the >> To: as to only their address and not know about everyone else. >> >> As for my “app”, it is a PHP app that uses the RoundCube Framework to >> provide an IMAP interface to the user for accessing their mailbox and some >> public mailboxes. The user sends messages using SMTP and I have a milter to >> send the message to a special outbox mailbox (in DBMail). Then, I have a >> PHP cron job that checks the outbox, retrieves the queued messages, >> preprocesses the message headers, and uses dbmail-deliver to send the >> message to the appropriate recipients. >> >> I have all this working quite nicely. But, I’m trying to figure out the >> best way to send a To: customized copy of each message to each recipient. >> >> I need to understand how DBMail does data deduplication. >> > > De-duplication is performed at two levels: > > messages are split by 'mime-parts'. The whole rfc2822 header is the > first part. If the body is a text/plain the whole body is a single, > second and last part. If the body is multipart/* or message/rfc822 the > process is restarted for the contained message or for each of the parts > that constitute the multipart. This is done recursively, limited at a > high recursion depth of 64. Or rather message de-construction is > unlimited, but re-construction is capped. > > each mime-part is stored de-duplicated in what is called single-instance > storage; keyed with a hash for faster retrieval. > > apart from the messages as a whole, the message-headers are also stored > seperately in two tables where both the header-name (to, from, subject) > is stored seperately from their content, the header-values. Both are > stored as unique values which are linked to each other, and to the > message instance where they occurred. > > So if you receive a 10MB message to one hundred users, where the > messages are identical, it is fully de-duplicated and only results in a > set of rows in the messages table - and under some circumstances the > physmessage table. > > If only the To header is different, the whole rfc822 header is stored in > it's own row in the mimeparts, but the full body is *not* duplicated. > Whether one header, or multiple headers differ between messages is not > an issue. Any difference will lead to a separate row for the headers. Of > course, the header-names and header-values are still stored de-duplicated. > > Hope that explains it a bit. > > -- > ________________________________________________________________ > Paul J Stevens pjstevns @ gmail, twitter, github, linkedin > www.nfg.nl/[email protected]/+31.85.877.99.97 > _______________________________________________ > DBmail mailing list > [email protected] > http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail
_______________________________________________ DBmail mailing list [email protected] http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail
