On Sun, 2011-01-23 at 14:08 +0100, Dirk Meyer wrote:
> Add file hash based on calculations from
> http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes

Based on the description in the URL, this code is wrong:

> +            filehash = 0
> +            if filesize >= 65536 * 2:
> +                filehash = filesize
> +                [...]

That site says:

        In natural language it calculates: size + 64bit chksum of the
        first and last 64k (even if they overlap because the file is
        smaller than 128k)

Your implementation computes a 0 hash if the file size is less than
128k.  But then the Python implementation at that page (which makes me
bleed from the eyeballs) returns "SizeError" (!!!) in this case too.

Skimming through the various implementations on that page makes it
pretty obvious that many of these are incompatible with each other when
it comes to the corner cases.

The first C implementation at least comes close to the description,
ambiguous as it is.  But this implementation fails to do the second
iteration on files less than 64k, because fseek() fails because MAX(0,
fsize - 65536) returns a large value when fsize < 65536 because fsize is
an unsigned value.  Sigh.

The second C implementation is incompatible with the first for such
files, and in fact could generate random hashes because it fails to
memset the buffer.

The C++ implementation has the same failing fseek() problem as the first
C implementation.  This one even goes out of its way to cast fsize to a
uint64_t, which is _already is_.  In fact, casting to int64_t is what it
should be doing.  But then MAX() accepts and returns int values, rather
than int64_t, so even if the broken cast was fixed, on 32-bit systems
this will be incorrect for files larger than 4G.

So I find myself more than a little irritated having read this page,
trying to understand how to implement this.  I mean, I realize they're
not writing software to launch shuttles into space here, but bloody
hell, a _little_ more rigor would be nice for people like us who are
trying to make a compatible implementation.

It's like reading the user contributions on php.net, where most of them
take the Darwinian approach to coding, randomly mutating their code
until it appears to be correct for the obvious cases, and then proudly
submitting their code as recipes for others to copy.

Ok, I'm done ranting.  I feel better.  I've modified kaa.metadata's
implementation to handle files less than 128k so the first and second
hashes will overlap, and for files less than 64k, it will read as many
full 64-bit values until it reaches EOF.  This makes it work like the
first C implementation if the unsigned fsize problem were fixed.

Jason.


------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Freevo-devel mailing list
Freevo-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freevo-devel

Reply via email to