Dan Stromberg <[EMAIL PROTECTED]> wrote: ... > I'm wanting to sort a large number of files, like a bunch of output files > from a large series of rsh or ssh outputs on a large series of distinct > machines, a music collection in .ogg format (strictly redistributable and > legally purchased music), a collection of .iso cdrom images (strictly > redistributable and legally purchased software), and so forth. > > I'm not sorting the content of each file individually. > > I'm treating each file as a potentially very large string, and "sorting > the strings".
OK, very clear. > I've been using the following compare function, which in short checks, in > order: > > 1) device number > 2) inode number > 3) file length > 4) the beginning of the file > 5) an md5 hash of the entire file > 6) the entire file Makes sense, including the possibility of computing some items on demand. However, using a key-callable rather than a cmp-callable may still make sense -- you just need a callable that extracts the attributes on demand (and caches them thereafter... assuming you have enough memory to keep all the caches, I guess [6] might be a problem). I do see that key-callables won't necessarily work easily here, but since the key-callable's __cmp__ will in turn be called, that might still work better... but, on reflection, it's probably just a minor gain. > However, my program is still doing more #6 comparisons than seems > strictly necessary when I could just toss all the filenames describing > identical files into a list, and avoid re-comparing files with identical > content over and over - I don't want to compare them to each other again > and again), when there are a lot of identical files in the input list. The comparison should never get called twice on the same two objects, anyway. So, I'm not sure what you think you could gain by optimizing future comparisons of two objects once you've ascertained they are in fact equal. Still, assuming for example that self.name is a unique identifier (i.e. the so-called name is in fact a complete path), the optimization (memoization) is quite easy to perform. Rename what you currently have as __cmp__ by another name, say _real_cmp, add a _compared dict to the class, and code the following __cmp__ ...: _compared = {} def __cmp__(self, other): try: return -self._compared[other.name, self.name] except KeyError: pass key = self.name, other.name if key in self._compared: return self._compared[key] result = self_compared[key] = self._real_cmp(other) return result I would not expect this optimization to matter at all, because no key should ever be already present in the self._compared dictionary (the same two files should never, ever get compared twice anyway). However, it might be possible to extend this idea by using the properties you know an ordering should have -- if A and B have never been compared, but both have been compared with C, in some cases you don't need to compare A and B but may exploit already-known results. For example, if A==C and B==C, you already know that A==B without any actual comparison; if A<C and B==C, you already know that A<B; and so on. Of course in some cases you still must work, e.g. if you know A<C and B<C that doesn't help. I'm not sure if this would help at all, either: it's quite possible that the sort algorithm already exploits these possibilities internally. But, perhaps, it may be worth a try. Alex -- http://mail.python.org/mailman/listinfo/python-list