Sounds like the only real extra storage for 100 messages that vary only by 
different To: headers isi the cost of storing 100 extra copies of the full 
headers.  That is good news to me.

Should I change the Message-ID: header on each copy?

How does DBMail use the Message-ID: header?

Thanks for the detailed explanation of how de-duplication works…

Kevin

On Apr 16, 2014, at 4:44 PM, Paul J Stevens <[email protected]> wrote:

> On 13-04-14 22:19, KT Walrus wrote:
>> DBMail already does a lot of data deduplication (headers, attachments, 
>> etc.).  I’m just not clear how far this goes and whether my turning a 
>> message to a list of recipients into multiple copies of the message with 
>> different To: and possibly different Message-Id: affects the data 
>> de-duplication.
>> 
>> If I should keep the headers the same for all copies of the message to get 
>> maximum data deduplication, I will.  I just prefer each recipient see the 
>> To: as to only their address and not know about everyone else.
>> 
>> As for my “app”, it is a PHP app that uses the RoundCube Framework to 
>> provide an IMAP interface to the user for accessing their mailbox and some 
>> public mailboxes.  The user sends messages using SMTP and I have a milter to 
>> send the message to a special outbox mailbox (in DBMail).  Then, I have a 
>> PHP cron job that checks the outbox, retrieves the queued messages, 
>> preprocesses the message headers, and uses dbmail-deliver to send the 
>> message to the appropriate recipients.  
>> 
>> I have all this working quite nicely.  But, I’m trying to figure out the 
>> best way to send a To: customized copy of each message to each recipient.
>> 
>> I need to understand how DBMail does data deduplication.
>> 
> 
> De-duplication is performed at two levels:
> 
> messages are split by 'mime-parts'. The whole rfc2822 header is the
> first part. If the body is a text/plain the whole body is a single,
> second and last part. If the body is multipart/* or message/rfc822 the
> process is restarted for the contained message or for each of the parts
> that constitute the multipart. This is done recursively, limited at a
> high recursion depth of 64. Or rather message de-construction is
> unlimited, but re-construction is capped.
> 
> each mime-part is stored de-duplicated in what is called single-instance
> storage; keyed with a hash for faster retrieval.
> 
> apart from the messages as a whole, the message-headers are also stored
> seperately in two tables where both the header-name (to, from, subject)
> is stored seperately from their content, the header-values. Both are
> stored as unique values which are linked to each other, and to the
> message instance where they occurred.
> 
> So if you receive a 10MB message to one hundred users, where the
> messages are identical, it is fully de-duplicated and only results in a
> set of rows in the messages table - and under some circumstances the
> physmessage table.
> 
> If only the To header is different, the whole rfc822 header is stored in
> it's own row in the mimeparts, but the full body is *not* duplicated.
> Whether one header, or multiple headers differ between messages is not
> an issue. Any difference will lead to a separate row for the headers. Of
> course, the header-names and header-values are still stored de-duplicated.
> 
> Hope that explains it a bit.
> 
> -- 
> ________________________________________________________________
> Paul J Stevens       pjstevns @ gmail, twitter, github, linkedin
>           www.nfg.nl/[email protected]/+31.85.877.99.97
> _______________________________________________
> DBmail mailing list
> [email protected]
> http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail

_______________________________________________
DBmail mailing list
[email protected]
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail

Reply via email to