Re: sorting with expensive compares?
On Sat, 24 Dec 2005 15:47:17 +1100, Steven D'Aprano wrote: On Fri, 23 Dec 2005 17:10:22 +, Dan Stromberg wrote: I'm treating each file as a potentially very large string, and sorting the strings. Which is a very strange thing to do, but I'll assume you have a good reason for doing so. I believe what the original poster wants to do is eliminate duplicate content from a collection of ogg/whatever files with different names. E.g., he has a python script that goes out and collects all the free music it can find on the web. The same song may appear on many sites under different names, and he wants only one copy of a given song. In any case, as others have pointed out, sorting by MD5 is sufficient except in cases far less probable than hardware failure - and deliberate collisions. E.g., the RIAA creates collision pairs of MP3 files where one member carries a freely redistributable license, and the other a copy this and we'll sue your ass off license in an effort to trap the unwary. -- Stuart D. Gathman [EMAIL PROTECTED] Business Management Systems Inc. Phone: 703 591-0911 Fax: 703 591-6154 Confutatis maledictis, flamis acribus addictis - background song for a Microsoft sponsored Where do you want to go from here? commercial. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
On Fri, 23 Dec 2005 21:56:44 -0800, Paul Rubin wrote: Steven D'Aprano [EMAIL PROTECTED] writes: There are also known ways of deliberately constructing md5 collisions (i.e. md5 is broken). Whether the OP should care about that depends on the application. Sure, but I don't he is deliberately trying to sabotage his own files :-) He might have downloaded a file created by a saboteur to have the same md5 as some popular music file, but which contains a subliminal hypnotic message which will brainwash him if played. Using a stronger hash, such as sha256, should protect him from this fate. But the odds of such a message having the same MD5 as an existing song on his disk is quite a lot higher than 2**64, unless he has a really, really large music collection ;) In the case you propose, two files don't just need to have the same MD5, but they also need to have a whole lot of other characterstics; both need to be (somewhat) valid MP3's, one needs to be a piece of music (or other sound) that is somewhat to the target's liking, and the other needs to be something playable with a subliminal message the target is likely to respond to. Calculate-me-odds-on-THAT-wink-ly 'yrs, -- Thomas Wouters [EMAIL PROTECTED] Hi! I'm a .signature virus! copy me into your .signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Thomas Wouters [EMAIL PROTECTED] writes: But the odds of such a message having the same MD5 as an existing song on his disk is quite a lot higher than 2**64, unless he has a really, really large music collection ;) In the case you propose, two files don't just need to have the same MD5, but they also need to have a whole lot of other characterstics; both need to be (somewhat) valid MP3's, one needs to be a piece of music (or other sound) that is somewhat to the target's liking, and the other needs to be something playable with a subliminal message the target is likely to respond to. The way the known collision attack works, the saboteur has to construct both files. However, the attacker does have a fair amount of control over the content. So he can start an innocent file circulating, then replace it with a sabotaged file on some network. A user might possibly somehow end up with both versions. See: http://www.cits.rub.de/MD5Collisions/ for how that kind of attack can work. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Dan Stromberg wrote: Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks! might be simpler to memoize cmp(), look in online cookbook or something like this decorator http://mail.python.org/pipermail/python-list/2005-October/303035.html http://aspn.activestate.com/ASPN/Python/Cookbook/ -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Dan Stromberg wrote: Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks! Sounds like DSU time. [a] - [ (hash(a), a) ] -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
[EMAIL PROTECTED] wrote: Dan Stromberg wrote: Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks! Sounds like DSU time. [a] - [ (hash(a), a) ] Aha! OR: take a log of the array, e.g. log base 10 or some other monotonic transform and permutation order indexes http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/306862 -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
gene tani wrote: [EMAIL PROTECTED] wrote: Dan Stromberg wrote: Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks! Sounds like DSU time. [a] - [ (hash(a), a) ] Aha! OR: take a log of the array, e.g. log base 10 or some other monotonic transform and permutation order indexes http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/306862 I may have made a mistaken in that hash(a) should be some function that returns the order of a, rather than the built-in hash() function. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Dan Stromberg wrote: Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? It's not really practical - if the list is unsorted, it's non-trivial to determine how many times a given element is duplicated until you've compared it with everything else. That is roughly an O(N*N/2) operation whereas sorting typically is O(NlogN). This is why C++'s 'std::unique' function only operates on sorted lists. So instead, one alternative would be to use a comparison function that takes the 2 objects and looks for the pair in a dictionary: if that pair is not found, perform the normal comparison and store it in the dictionary for next time, and if it is found, just return it. This way the actual comparison is only done once for each pair. Alternatively you might be able to produce a cheap comparison from the expensive one if you can summarise the information in a simpler form. Perhaps each sorting criterion can be represented as an integer, and you can instead sort a list of lists containing integers. -- Ben Sizer -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
[EMAIL PROTECTED] wrote: Dan Stromberg wrote: Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Sounds like DSU time. [a] - [ (hash(a), a) ] This won't work - elements with different hashes will sort by hash and elements with the same hash will still be compared which is exactly what the OP is trying to avoid. If there is some function of the arrays which sorts in the same order as the natural comparison then that function can be used as a sort key. sort(arrayList, key=some_proxy_function) Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Kent Johnson [EMAIL PROTECTED] writes: [a] - [ (hash(a), a) ] This won't work - elements with different hashes will sort by hash and elements with the same hash will still be compared which is exactly what the OP is trying to avoid. ds = sorted([(hash(c), i) for i,c in enumerate(a)]) dsu = [a[i] for hc,i in ds] Is that what you mean? I didn't answer the OP because I couldn't understand the original question. The above brings together clusters of elements with the same hash, so if the clusters are large you can finish the sort with relatively few comparisons. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
In article [EMAIL PROTECTED], Dan Stromberg [EMAIL PROTECTED] wrote: Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? I'll just note that Python's sort is specifically designed to reduce the number of compares precisely because *all* compares in Python are relatively expensive. I'll suggest a variant on the previous suggestion of hash: [a] - [hash(a), index(a), a] -- Aahz ([EMAIL PROTECTED]) * http://www.pythoncraft.com/ Don't listen to schmucks on USENET when making legal decisions. Hire yourself a competent schmuck. --USENET schmuck (aka Robert Kern) -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
On Thu, 22 Dec 2005 22:06:42 +, Dan Stromberg wrote: Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks! I probably should've been more specific. I'm wanting to sort a large number of files, like a bunch of output files from a large series of rsh or ssh outputs on a large series of distinct machines, a music collection in .ogg format (strictly redistributable and legally purchased music), a collection of .iso cdrom images (strictly redistributable and legally purchased software), and so forth. I'm not sorting the content of each file individually. I'm treating each file as a potentially very large string, and sorting the strings. I've been using the following compare function, which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file (If #1 and #2 are identical, then the file must be a hardlink to the other file. Also, files that do not have the same length can never be identical. And of course these items are generated on demand, making it frequently possible to avoid doing full-file-length compare on a lot of files) However, my program is still doing more #6 comparisons than seems strictly necessary when I could just toss all the filenames describing identical files into a list, and avoid re-comparing files with identical content over and over - I don't want to compare them to each other again and again), when there are a lot of identical files in the input list. Thanks! def __cmp__(self,other): # sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name)) if verbosity = 1: sys.stderr.write('Comparing file_class objects %s and %s\n' % (self.name, other.name)) if self.device == -1: if verbosity: sys.stderr.write('Getting stat data for file %s\n' % self.name) result = os.stat(self.name) self.device = result[stat.ST_DEV] self.inode = result[stat.ST_INO] self.size = result[stat.ST_SIZE] if other.device == -1: if verbosity: sys.stderr.write('Getting stat data for file %s\n' % other.name) result = os.stat(other.name) other.device = result[stat.ST_DEV] other.inode = result[stat.ST_INO] other.size = result[stat.ST_SIZE] if self.device == other.device and self.inode == other.inode: # same device, same inode, must be identical files return 0 if self.length other.length: return -1 elif self.length other.length: return 1 # if we've reached this point, the files are not hardlinks, and their lengths are identical # so slurp in the prefixes if needed, then compare them if self.prefix == -1: if verbosity: sys.stderr.write('Getting prefix for file %s\n' % self.name) file = open(self.name, 'r') self.prefix = file.read(self.max_prefix_length) file.close() if other.prefix == -1: if verbosity: sys.stderr.write('Getting prefix for file %s\n' % other.name) file = open(other.name, 'r') other.prefix = file.read(self.max_prefix_length) file.close() if self.prefix other.prefix: return -1 elif self.prefix other.prefix: return 1 # if we've reached this point, we know that: # 1) The files are not hardlinks of each other # 2) They have identical sizes # 3) Their prefixes are identical # 4) We haven't yet tried doing a cryptographic hash, so compute them if needed, and compare them if self.hash == '': self.gen_hash() if other.hash == '': other.gen_hash() if self.hash other.hash: return -1 elif self.hash other.hash: return 1 # if we've reached this point, we know that: # 1) The files are not hardlinks of each other # 2) They have identical sizes # 3) Their prefixes are identical
Re: sorting with expensive compares?
Dan Stromberg wrote: On Thu, 22 Dec 2005 22:06:42 +, Dan Stromberg wrote: Hi folks. Python appears to have a good sort method, but when sorting array elements that are very large, and hence have very expensive compares, is there some sort of already-available sort function that will merge like elements into a chain, so that they won't have to be recompared as many times? Thanks! I probably should've been more specific. I'm wanting to sort a large number of files, like a bunch of output files from a large series of rsh or ssh outputs on a large series of distinct machines, a music collection in .ogg format (strictly redistributable and legally purchased music), a collection of .iso cdrom images (strictly redistributable and legally purchased software), and so forth. I'm not sorting the content of each file individually. I'm treating each file as a potentially very large string, and sorting the strings. I've been using the following compare function, which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file (If #1 and #2 are identical, then the file must be a hardlink to the other file. Also, files that do not have the same length can never be identical. And of course these items are generated on demand, making it frequently possible to avoid doing full-file-length compare on a lot of files) However, my program is still doing more #6 comparisons than seems strictly necessary when I could just toss all the filenames describing identical files into a list, and avoid re-comparing files with identical content over and over - I don't want to compare them to each other again and again), when there are a lot of identical files in the input list. Why would #5 not enough as an indicator that the files are indentical ? -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Dan Stromberg [EMAIL PROTECTED] wrote: ... I'm wanting to sort a large number of files, like a bunch of output files from a large series of rsh or ssh outputs on a large series of distinct machines, a music collection in .ogg format (strictly redistributable and legally purchased music), a collection of .iso cdrom images (strictly redistributable and legally purchased software), and so forth. I'm not sorting the content of each file individually. I'm treating each file as a potentially very large string, and sorting the strings. OK, very clear. I've been using the following compare function, which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file Makes sense, including the possibility of computing some items on demand. However, using a key-callable rather than a cmp-callable may still make sense -- you just need a callable that extracts the attributes on demand (and caches them thereafter... assuming you have enough memory to keep all the caches, I guess [6] might be a problem). I do see that key-callables won't necessarily work easily here, but since the key-callable's __cmp__ will in turn be called, that might still work better... but, on reflection, it's probably just a minor gain. However, my program is still doing more #6 comparisons than seems strictly necessary when I could just toss all the filenames describing identical files into a list, and avoid re-comparing files with identical content over and over - I don't want to compare them to each other again and again), when there are a lot of identical files in the input list. The comparison should never get called twice on the same two objects, anyway. So, I'm not sure what you think you could gain by optimizing future comparisons of two objects once you've ascertained they are in fact equal. Still, assuming for example that self.name is a unique identifier (i.e. the so-called name is in fact a complete path), the optimization (memoization) is quite easy to perform. Rename what you currently have as __cmp__ by another name, say _real_cmp, add a _compared dict to the class, and code the following __cmp__ ...: _compared = {} def __cmp__(self, other): try: return -self._compared[other.name, self.name] except KeyError: pass key = self.name, other.name if key in self._compared: return self._compared[key] result = self_compared[key] = self._real_cmp(other) return result I would not expect this optimization to matter at all, because no key should ever be already present in the self._compared dictionary (the same two files should never, ever get compared twice anyway). However, it might be possible to extend this idea by using the properties you know an ordering should have -- if A and B have never been compared, but both have been compared with C, in some cases you don't need to compare A and B but may exploit already-known results. For example, if A==C and B==C, you already know that A==B without any actual comparison; if AC and B==C, you already know that AB; and so on. Of course in some cases you still must work, e.g. if you know AC and BC that doesn't help. I'm not sure if this would help at all, either: it's quite possible that the sort algorithm already exploits these possibilities internally. But, perhaps, it may be worth a try. Alex -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Dan Stromberg wrote: I'm wanting to sort a large number of files, like a bunch of output files from a large series of rsh or ssh outputs on a large series of distinct machines, a music collection in .ogg format (strictly redistributable and legally purchased music), a collection of .iso cdrom images (strictly redistributable and legally purchased software), and so forth. Are you really trying to establish an order or do want to eliminate the duplicates? File(perfectly_legal.ogg) File(free_of_charge.mp3) True doesn't make that much sense to me, regardless of what the comparison may actually do. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
[EMAIL PROTECTED] wrote: Dan Stromberg wrote: [...] I've been using the following compare function, which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file [...] Why would #5 not enough as an indicator that the files are indentical ? Because it doesn't guarantee that the files are identical. It indicates, to a very high degree of probability (particularly when the file lengths are equal), that the two files are the same, but it doesn't guarantee it. Technically there are in infinite number of inputs that can produce the same md5 hash. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/ -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
On Fri, 23 Dec 2005 09:20:55 -0800, bonono wrote: Dan Stromberg wrote: [snip] I've been using the following compare function, which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file (If #1 and #2 are identical, then the file must be a hardlink to the other file. Also, files that do not have the same length can never be identical. And of course these items are generated on demand, making it frequently possible to avoid doing full-file-length compare on a lot of files) However, my program is still doing more #6 comparisons than seems strictly necessary when I could just toss all the filenames describing identical files into a list, and avoid re-comparing files with identical content over and over - I don't want to compare them to each other again and again), when there are a lot of identical files in the input list. Why would #5 not enough as an indicator that the files are indentical ? (1) Because the MD5 algorithm does include collisions. I was going to say rare collisions, but there is actually an infinite number of them. The collisions are just distributed thinly -- because MD5 is a good hash algorithm, *very* thinly. (Proof of this comes from the pigeon-hole principle: there is an infinite number of possible files of arbitrary length, and only a finite number of possible hashes. Therefore, an infinite number of files must hash to each possible hash.) (2) Having said that, unless the original poster is dealing with billions (plus) of files, it is unlikely that he is finding any of the collisions unless there is a bug in his sort routine. Since he claims to be doing more comparisions-by-file-contents than expected (I would suggest *one* is more than expected), I predict a bug in his code, his algorithm, or both. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
On Fri, 23 Dec 2005 19:26:11 +0100, Peter Otten wrote: Dan Stromberg wrote: I'm wanting to sort a large number of files, like a bunch of output files from a large series of rsh or ssh outputs on a large series of distinct machines, a music collection in .ogg format (strictly redistributable and legally purchased music), a collection of .iso cdrom images (strictly redistributable and legally purchased software), and so forth. Are you really trying to establish an order or do want to eliminate the duplicates? File(perfectly_legal.ogg) File(free_of_charge.mp3) True doesn't make that much sense to me, regardless of what the comparison may actually do. If I have understood the poster's algorithm correctly, it gets even weirder: Sorted list of files - [parrot.ogg, redhat.iso, george.log, fred.log, rhino.ogg, cat.ogg, debian.iso, sys_restore.iso, adrian.log, fox.ogg, ...] It seems to this little black duck that by sorting by file contents in this way, the effect to the human reader is virtually to randomise the list of file names. Even if you limit yourself to (say) a set of ogg files, and sort by the binary contents - # album-track [parrot-6.ogg, rhino-1.ogg, cat-12.ogg, fox-2.ogg, parrot-3.ogg, ...] most people looking at the list would guess it had been shuffled, not sorted. So I too don't know what the original poster hopes to accomplish by sorting on the content of large binary files. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Dan Stromberg [EMAIL PROTECTED] writes: I've been using the following compare function, which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file (If #1 and #2 are identical, then the file must be a hardlink to the other file. Also, files that do not have the same length can never be identical. And of course these items are generated on demand, making it frequently possible to avoid doing full-file-length compare on a lot of files) However, my program is still doing more #6 comparisons than seems Just don't do any #6 comparisons. If the hashes are identical, the files are identical--that's the point of a cryptographic hash. OK, to be pedantic: 1) there's a microscopic chance of an accidental collision, but it's much lower than the chance of a hardware failure making your comparison function give the wrong answer. 2) there are known cryptanalytic attacks against md5 that can let someone deliberately construct two distinct files with the same hash, with a fair amount of efort. If you care about this, use sha-1 instead, or better yet, sha-256. (sha-1 has an attack similar to the md5 attack, but the amount of work required is not really practical today, unlike the md5 attack). -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
On Fri, 23 Dec 2005 17:10:22 +, Dan Stromberg wrote: I probably should've been more specific. I'm wanting to sort a large number of files, like a bunch of output files from a large series of rsh or ssh outputs on a large series of distinct machines, a music collection in .ogg format (strictly redistributable and legally purchased music), a collection of .iso cdrom images (strictly redistributable and legally purchased software), and so forth. I'm not sorting the content of each file individually. I'm treating each file as a potentially very large string, and sorting the strings. Which is a very strange thing to do, but I'll assume you have a good reason for doing so. I've been using the following compare function, My first suggestion is that you invest a bit of time in building a small-scale set of data that you can test your sorting on, if you haven't already done so. On multi-megabyte files, taking the MD5 hash or comparing the content byte by byte is quite slow. So I would spend a few minutes populating a directory with a dozen or so fake iso and ogg files, only a few bytes in length each, and run my tests on them. which in short checks, in order: 1) device number 2) inode number 3) file length 4) the beginning of the file 5) an md5 hash of the entire file 6) the entire file (If #1 and #2 are identical, then the file must be a hardlink to the other file. Also, files that do not have the same length can never be identical. And of course these items are generated on demand, making it frequently possible to avoid doing full-file-length compare on a lot of files) My second suggestion is to encapsulate vigorously, at least for testing. At the moment, you have a single __cmp__ function that does everything in-line. I would do something like this: def __cmp__(self, other): result = compare_by_device_number_and_inode(self, other) if result is None: result = compare_by_file_length(self, other) if result is None: ... return result You can always unroll the function calls later, but for now it helps in debugging: you only need to think about one type of comparison at a time, and makes it easier to think about what each function needs to do. However, my program is still doing more #6 comparisons than seems strictly necessary when I could just toss all the filenames describing identical files into a list, and avoid re-comparing files with identical content over and over - I don't want to compare them to each other again and again), when there are a lot of identical files in the input list. As others have pointed out, Python's sort never compares the same objects more than once. Some more comments follow interleaved with your code: def __cmp__(self,other): # sys.stderr.write('Comparing %s and %s\n' % (self.name, other.name)) Are you sure you don't want to just sort by file name? That would be much simpler *wink* if verbosity = 1: sys.stderr.write('Comparing file_class objects %s and %s\n' % (self.name, other.name)) if self.device == -1: if verbosity: sys.stderr.write('Getting stat data for file %s\n' % self.name) result = os.stat(self.name) self.device = result[stat.ST_DEV] self.inode = result[stat.ST_INO] self.size = result[stat.ST_SIZE] Since you need to compare stat data for all these objects, why not simply pre-fetch them when you create the object? That way you can simplify your __cmp__ function significantly. [snip] if self.device == other.device and self.inode == other.inode: # same device, same inode, must be identical files return 0 At first I thought that my news client had mangled your function, but going back to the original post, either *your* client has mangled it, or I've found a bug in your code. return 0 is commented out. Is it possible that all the excess file comparisons are being done only for hard links? [snip] if verbosity: sys.stderr.write('Getting prefix for file %s\n' % self.name) file = open(self.name, 'r') self.prefix = file.read(self.max_prefix_length) file.close() prefix is probably not the best name. Maybe header (as in file header)? If the prefix/header is short enough, don't worry about fetching on demand, at least not while you are still debugging. Just pre-fetch it, get your code working, and then convert back to fetching the header on demand. if other.prefix == -1: if verbosity: sys.stderr.write('Getting prefix for file %s\n' % other.name) file = open(other.name, 'r')
Re: sorting with expensive compares?
Steven D'Aprano [EMAIL PROTECTED] writes: http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/ the expected number of random unique files you would need to compare before finding a single collision in the MD5 hashes is (very roughly) 10**70, or ten billion trillion trillion trillion trillion trillion. That's not right, the right number is around 2**64 or 2e19. See the section of that page about the birthday attack. It's still an awful lot. There are also known ways of deliberately constructing md5 collisions (i.e. md5 is broken). Whether the OP should care about that depends on the application. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
On Fri, 23 Dec 2005 21:07:57 -0800, Paul Rubin wrote: Steven D'Aprano [EMAIL PROTECTED] writes: http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/ the expected number of random unique files you would need to compare before finding a single collision in the MD5 hashes is (very roughly) 10**70, or ten billion trillion trillion trillion trillion trillion. That's not right, the right number is around 2**64 or 2e19. See the section of that page about the birthday attack. It's still an awful lot. Oh poot, a stupid typo! I entered 1e60 in my calculation instead of 1e6, which is doubly embarrassing because the correct number to use is 1e9: [quote] In MD5's case, 2**64 messages need to be tried. This is not a feasible attack given today's technology. If you could try 1,000,000 messages per second, it would take 584,942 years to find a collision. A machine that could try 1,000,000,000 messages per second would take 585 years, on average. [end quote] So the expected number of files (messages) needed to check to find a collision is: 585*365*24*60*60*1e9 = 1.8e19 or more than ten million million million, which of course equals 2**64. There are also known ways of deliberately constructing md5 collisions (i.e. md5 is broken). Whether the OP should care about that depends on the application. Sure, but I don't he is deliberately trying to sabotage his own files :-) -- Steven. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
[Steven D'Aprano] ... As others have pointed out, Python's sort never compares the same objects more than once. Others have pointed it out, and it's getting repeated now, but it's not true. Since I wrote Python's sorting implementation, it's conceivable that I'm not just making this up ;-) Here's the shortest counter-example: [10, 30, 20]. The first thing sort() does is compute the length of the longest natural run (whether increasing or decreasing) starting at index 0. It compares 10 and 30, sees that the list starts with an increasing run, then compares 30 and 20 to see whether this run can be extended to length 3. It can't. Since the list has only 3 elements, it decides to use a binary insertion sort to move the 3rd element into place. That requires 2 comparisons, the first of which _again_ compares 30 and 20. This doesn't bother me, and it would cost more to avoid this than it could possibly save in general. It's true that the workhorse binary insertion and merge sorts never compare the same elements more than once, but there is some potential duplication between those and comparisons done to identify natural runs. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Steven D'Aprano [EMAIL PROTECTED] writes: There are also known ways of deliberately constructing md5 collisions (i.e. md5 is broken). Whether the OP should care about that depends on the application. Sure, but I don't he is deliberately trying to sabotage his own files :-) He might have downloaded a file created by a saboteur to have the same md5 as some popular music file, but which contains a subliminal hypnotic message which will brainwash him if played. Using a stronger hash, such as sha256, should protect him from this fate. -- http://mail.python.org/mailman/listinfo/python-list
Re: sorting with expensive compares?
Tim Peters [EMAIL PROTECTED] wrote: [Steven D'Aprano] ... As others have pointed out, Python's sort never compares the same objects more than once. Others have pointed it out, and it's getting repeated now, but it's not true. Since I wrote Python's sorting implementation, it's conceivable that I'm not just making this up ;-) ... could possibly save in general. It's true that the workhorse binary insertion and merge sorts never compare the same elements more than once, but there is some potential duplication between those and comparisons done to identify natural runs. Since I probably was the first one to point out the erroneous factoid in question, I apologize -- obviously, my memories of the inner workings of sort were faulty, and didn't consider that potential duplication. In this case, the tricks I already (though dubiously;-) suggested in order to avoid any avoidable comparisons might pay for themselves and then some, if comparisons are indeed extremely costly. Alex -- http://mail.python.org/mailman/listinfo/python-list