While there has been quite a bit of discussion already on this topic, and I know the purpose of this was to avoid collisions, has there been any thought in a configuration option to skip the blob comparison and assume they are the same? Perhaps a size threshold that has to be exceeded to skip the check. My only thought is a large attachment - say 100-200 MB - has to be read from the db server and into memory. This would be very taxing on the network and memory of the whole system. For small shops with the db on the same box, its not that big of an issue, but for larger shops with a lot of email traffic, it could become an issue. A value of "0" should mean check every part, while of value of "16777216" would mean check all parts less than 16 MB. As file sizes increase, the likely hood of a size and hash collision is going to decrease, especially since larger attachments are rare compared to the 1.5 MB jpeg.

-Jon


Paul J Stevens wrote:
Matija Grabnar wrote:
I re-iterate: regardless of which digest algorithm is chosen, the code
MUST be able to
detect and correctly handle collisions. Collisions WILL occur,
regardless of the algorithm
chosen. It is a mathematically provable fact.

For those of you who have been following this discussion: I've done this
thing.

- we now use the cryptographic hash only to quickly locate possibly
duplicate mime-parts, If the hash doesn't occur yet, a new mimepart is
stored using the hash, but generating an auto-increment bigint as it's
primary key. If the hash does occur, the insertion code compares the
blobs to make sure no hash collision occurs on different blobs.

- I've added support for a whole dumpload of hashes: we now support md5,
sha1, sha256, sha512, tiger and whirlpool. Since I'm relying on mhash
for this, it would be trivial to add other hashes like ghost, but I'm
currently restricting things to the ones documenten on the nessie (EU)
pages. Looking back, adding all these was probably not really necessary
for single-instance storage, but libmhash is rock-solid and widely
available, and I have a hunch they might come in handy along the road.





--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
DBmail mailing list
DBmail@dbmail.org
https://mailman.fastxs.nl/mailman/listinfo/dbmail

Reply via email to