Support (optional) CRC or long hashkey generation for bodies 
-------------------------------------------------------------

         Key: JBMAIL-36
         URL: http://jira.jboss.com/jira/browse/JBMAIL-36
     Project: JBoss Mail
        Type: Sub-task
    Versions: 1.0-M4    
    Reporter: Andrew Oliver
 Assigned to: Andrew Oliver 
    Priority: Critical
     Fix For: 1.0-M4


The M3 Message Store prevents bodies from being stored multiple times and 
allows messages to stream directly to the DB.  For large messages a line by 
line hash should be calculable and if it matches an existing message (this 
optimizes for disk size but costs performance) then the Mailbox entry is 
reassigned to the existing mailstore and then the new body is deleted.

Example.  

1. Assume that the following is a 64mb stream that comes in (minus headers) in 
duplicate for both mails (meaning we're sending the same file):

body line                          CRC/checksum/whatever

XXXXXXXXX...XXXXXXXXXXXXXXXXXXX    123456
YYYYYYYYY...YYYYYYYYYYYYYYYYYYY    654321
ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ    321654
...............................    ......
XXXXXXXXX...XXXXXXXXXXXXXXXXXXX    123456
YYYYYYYYY...YYYYYYYYYYYYYYYYYYY    654321
ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ    321654

cumulative checksum accurate to at least 1/50000000

12341235125132412512  

if a "select body_id from bodies where checksum='12341235125132412512'" returns 
more than 1 result then the new body is deleted and the mailbox is assigned to 
the older of the two.

So the idea above is important, algorythmic and method suggestions are not (I 
don't know my posterior from my elbow when it comes to efficient binary 
similarity detection -- I'm just pretty sure that's not to be done by direct 
matching on content!).  

It is important that minor revisions not cause collisions.  So the 1/50000000 
target for minimum collision should not be taken to mean if you send me a doc, 
I edit it and send it back that it drops my edits and that's okay.  It means 
that for this to be a viable algoyrthm if I upload the text of a speech and you 
upload a completely different speech and somehow the checksum comes out just 
right....we could have that 1/50,000,000 chance of two very different documents 
getting the same check, a minor revision to either should fix it.

It is also important that proper boundries be created (no chance that one time 
we include fuzz surrounding the body and another time we don't).


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://jira.jboss.com/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
JBoss-Development mailing list
JBoss-Development@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/jboss-development

Reply via email to