2010/1/13 Avi Greenbury <avismailinglistacco...@googlemail.com>: > You might've missed his point. > > If two files are of different sizes, they cannot be identical. Getting > the size of a file is substantially cheaper than hashing it. > > So you check all your filesizes, and need only hash those pairs or > groups that are all the same size.
Sorry guess I didn't make myself clear. I need to store the SHA in an SQLite file. I have a few files to handle now but I will get a constant dribble from now on. I want to try and ensure that I haven't already databased a file that I'll process in the future. Incident I get poor results from the MD5 compared with SHA so I can't relie on MD5 for MD5 (md5_base64) results: mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 SHA (b64digest) results: mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27 MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27 PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27 mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27 PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27 > Thirdly, be aware of what hashing guarantees. It does *not* guarantee > uniqueness, it just gives you a very low chance that two files with > the same hash are different. It does guarantee that files with > different hashes are different, though. > I think that's the best I can hope for. If that 'duplicate.pdf' turned up again at least I be able to correctly identify it. That's the goal. I will give fdupes a look too. Thanks all. Dp.