On Tue, 31 Jan 2006, it was written: > Steven D'Aprano <[EMAIL PROTECTED]> writes: > >> This isn't a criticism, it is a genuine question. Why do people compare >> local files with MD5 instead of doing a byte-to-byte compare?
I often wonder that! >> Is it purely a caching thing (once you have the checksum, you don't >> need to read the file again)? Are there any other reasons? > > It's not just a matter of comparing two files. The idea is you have > 10,000 local files and you want to find which ones are duplicates (i.e. > if files 637 and 2945 have the same contents, you want to discover > that). The obvious way is make a list of hashes, and sort the list. Obvious, perhaps, prudent, no. To make the list of hashes, you have to read all of every single file first, which could take a while. If your files are reasonably random at the beginning, you'd be better off just using the first N bytes of the file, since this would be just as effective, and cheaper to read. Looking at some random MP3s i have to hand, they all differ within the first 20 bytes - probably due to the ID3 tags, so this should work for these. Better yet, note that if two files are identical, they must have the same length, and that finding the length can be done very cheaply, so a quicker yet approach is to make a list of lengths, sort that, and look for duplicates; when you find one, do a byte-by-byte comparison of the files (probably terminating in the first block) to see if they really are the same. By way of example, of the 2690 music files in my iTunes library, i have twelve pairs of same-sized files [1], and all of these differ within the first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule out duplication with just 22 data blocks read from disk (plus rather more blocks of directory information and inodes, of course). A hash-based approach would have had to wade through a touch over 13 GB of data before it could even get started. Of course, there are situations where this is the wrong approach - if you have a collection of serialised sparse matrices, for example, which consist of identically-sized blocks of zeroes with a scattering of ones throughout, then lengths and prefixes will be useless, whereas hashes will work perfectly. However, here, we're looking at MP3s, where lengths and prefixes will be a win. tom [1] The distribution of those is a bit weird: ten pairs consist of two tracks from The Conet Project's 'Recordings of Shortwave Numbers Stations', one is a song from that and The Doors' 'Horse Latitudes', and one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from the same artist? Is it really that they're pairs of songs from the same compressor (those tracks aren't from CD), i wonder? -- Not all legislation can be eye-catching, and it is important that the desire to achieve the headlines does not mean that small but useful measures are crowded out of the legislative programme. -- Select Committee on Transport -- http://mail.python.org/mailman/listinfo/python-list