On Sun, 2011-01-23 at 14:08 +0100, Dirk Meyer wrote: > Add file hash based on calculations from > http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes
Based on the description in the URL, this code is wrong: > + filehash = 0 > + if filesize >= 65536 * 2: > + filehash = filesize > + [...] That site says: In natural language it calculates: size + 64bit chksum of the first and last 64k (even if they overlap because the file is smaller than 128k) Your implementation computes a 0 hash if the file size is less than 128k. But then the Python implementation at that page (which makes me bleed from the eyeballs) returns "SizeError" (!!!) in this case too. Skimming through the various implementations on that page makes it pretty obvious that many of these are incompatible with each other when it comes to the corner cases. The first C implementation at least comes close to the description, ambiguous as it is. But this implementation fails to do the second iteration on files less than 64k, because fseek() fails because MAX(0, fsize - 65536) returns a large value when fsize < 65536 because fsize is an unsigned value. Sigh. The second C implementation is incompatible with the first for such files, and in fact could generate random hashes because it fails to memset the buffer. The C++ implementation has the same failing fseek() problem as the first C implementation. This one even goes out of its way to cast fsize to a uint64_t, which is _already is_. In fact, casting to int64_t is what it should be doing. But then MAX() accepts and returns int values, rather than int64_t, so even if the broken cast was fixed, on 32-bit systems this will be incorrect for files larger than 4G. So I find myself more than a little irritated having read this page, trying to understand how to implement this. I mean, I realize they're not writing software to launch shuttles into space here, but bloody hell, a _little_ more rigor would be nice for people like us who are trying to make a compatible implementation. It's like reading the user contributions on php.net, where most of them take the Darwinian approach to coding, randomly mutating their code until it appears to be correct for the obvious cases, and then proudly submitting their code as recipes for others to copy. Ok, I'm done ranting. I feel better. I've modified kaa.metadata's implementation to handle files less than 128k so the first and second hashes will overlap, and for files less than 64k, it will read as many full 64-bit values until it reaches EOF. This makes it work like the first C implementation if the unsigned fsize problem were fixed. Jason. ------------------------------------------------------------------------------ Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! Finally, a world-class log management solution at an even better price-free! Download using promo code Free_Logger_4_Dev2Dev. Offer expires February 28th, so secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsight-sfd2d _______________________________________________ Freevo-devel mailing list Freevo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/freevo-devel