On Wed, Jan 13, 2010 at 15:58, Dermot <paik...@googlemail.com> wrote: > 2010/1/13 Avi Greenbury <avismailinglistacco...@googlemail.com>: > >> You might've missed his point. >> >> If two files are of different sizes, they cannot be identical. Getting >> the size of a file is substantially cheaper than hashing it. >> >> So you check all your filesizes, and need only hash those pairs or >> groups that are all the same size. > > Sorry guess I didn't make myself clear. I need to store the SHA in an > SQLite file.
I think you're putting the cart before the horse. Did someone come up to you and say, "Dermot, put the SHA value in a database."? I would have thought that you *need* to make sure that you detect duplicate files (for example, to avoid processing "the same" file twice). Storing the SHA in an SQLite file is a method you would *like* to use to accomplish this, but may not be the only way nor the best way. Along those lines, you may wish to store the filesize in bytes in your database as well, as a first point of comparison; if the filesize is unique, then the file must also be unique and you could save yourself the time spent calculating a digest of the file's contents -- no 1058-byte file can be the same as any 1927-byte file. > Incident I get poor results from the MD5 compared with SHA so I can't > relie on MD5 for > > MD5 (md5_base64) results: > mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 > MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 > mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 > PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 > > SHA (b64digest) results: > mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27 > MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 > duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 > MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27 > PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27 > mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27 > PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27 That's... odd. md5sum's guarantee of "same if the hashes match" isn't as strong as SHA's, but I still wouldn't expect two files to md5sum the same if their SHA sums don'T match. However, those MD5 sums don't look like base-64 to me, so maybe you're doing something wrong somewhere. Cheers, Philip -- Philip Newton <philip.new...@gmail.com>