On Wed, 2008-08-06 at 12:31 -0700, LaundroMat wrote: > Hi - > > I'm trying to calculate unique hash values for binary files, > independent of their location and filename, and I was wondering > whether I'm going in the right direction. > > Basically, the hash values are calculated thusly: > > f = open('binaryfile.bin') > import hashlib > h = hashlib.sha1() > h.update(f.read()) > hash = h.hexdigest() > f.close() > > A quick try-out shows that effectively, after renaming a file, its > hash remains the same as it was before. > > I have my doubts however as to the usefulness of this. As f.read() > does not seem to read until the end of the file (for a 3.3MB file only > a string of 639 bytes is being returned, perhaps a 00-byte counts as > EOF?), is there a high danger for collusion? > > Are there better ways of calculating hash values of binary files? > > Thanks in advance, > > Mathieu > -- > http://mail.python.org/mailman/listinfo/python-list
Looks like you're doing the right thing from here. file.read( ) with no size parameter will always return the whole file (for completeness, I'll mention that the documentation warns this is not the case if the file is in non-blocking mode, which you're not doing). Python never treats null bytes as special in strings, so no, you're not getting an early EOF due to that. I wouldn't worry about your hashing code, that looks fine, if I were you I'd try and figure out what's going wrong with your file handles. I would suspect that in where ever you saw your short read, you were likely not opening the file in the correct mode or did not rewind the file ( with file.seek( 0 ) ) after having previously read data from it. You'll be fine if you use the code above as is, there's no problems I can see with it. -- John Krukoff <[EMAIL PROTECTED]> Land Title Guarantee Company -- http://mail.python.org/mailman/listinfo/python-list