Re: sorting with expensive compares?

2005-12-28 Thread Stuart D. Gathman
On Sat, 24 Dec 2005 15:47:17 +1100, Steven D'Aprano wrote:

 On Fri, 23 Dec 2005 17:10:22 +, Dan Stromberg wrote:

 I'm treating each file as a potentially very large string, and sorting
 the strings.
 
 Which is a very strange thing to do, but I'll assume you have a good
 reason for doing so.

I believe what the original poster wants to do is eliminate duplicate
content from a collection of ogg/whatever files with different names. 
E.g., he has a python script that goes out and collects all the free music
it can find on the web.  The same song may appear on many sites under
different names, and he wants only one copy of a given song.

In any case, as others have pointed out, sorting by MD5 is sufficient
except in cases far less probable than hardware failure - and deliberate
collisions.  E.g., the RIAA creates collision pairs of MP3 files where one
member carries a freely redistributable license, and the other a copy
this and we'll sue your ass off license in an effort to trap the unwary.

-- 
  Stuart D. Gathman [EMAIL PROTECTED]
Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
Confutatis maledictis, flamis acribus addictis - background song for
a Microsoft sponsored Where do you want to go from here? commercial.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-25 Thread Thomas Wouters
On Fri, 23 Dec 2005 21:56:44 -0800, Paul Rubin wrote:

 Steven D'Aprano [EMAIL PROTECTED] writes:
  There are also known ways of deliberately constructing md5 collisions
  (i.e. md5 is broken).  Whether the OP should care about that depends
  on the application.
 
 Sure, but I don't he is deliberately trying to sabotage his own files :-)
 
 He might have downloaded a file created by a saboteur to have the same
 md5 as some popular music file, but which contains a subliminal
 hypnotic message which will brainwash him if played.  Using a stronger
 hash, such as sha256, should protect him from this fate.

But the odds of such a message having the same MD5 as an existing
song on his disk is quite a lot higher than 2**64, unless he has a really,
really large music collection ;) In the case you propose, two files don't
just need to have the same MD5, but they also need to have a whole lot of
other characterstics; both need to be (somewhat) valid MP3's, one needs to
be a piece of music (or other sound) that is somewhat to the target's
liking, and the other needs to be something playable with a subliminal
message the target is likely to respond to.

Calculate-me-odds-on-THAT-wink-ly 'yrs,
-- 
Thomas Wouters [EMAIL PROTECTED]

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-25 Thread Paul Rubin
Thomas Wouters [EMAIL PROTECTED] writes:
 But the odds of such a message having the same MD5 as an existing
 song on his disk is quite a lot higher than 2**64, unless he has a really,
 really large music collection ;) In the case you propose, two files don't
 just need to have the same MD5, but they also need to have a whole lot of
 other characterstics; both need to be (somewhat) valid MP3's, one needs to
 be a piece of music (or other sound) that is somewhat to the target's
 liking, and the other needs to be something playable with a subliminal
 message the target is likely to respond to.

The way the known collision attack works, the saboteur has to
construct both files.  However, the attacker does have a fair amount
of control over the content.  So he can start an innocent file
circulating, then replace it with a sabotaged file on some network.
A user might possibly somehow end up with both versions.

See: http://www.cits.rub.de/MD5Collisions/ for how that kind of attack
can work.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread gene tani

Dan Stromberg wrote:
 Hi folks.

 Python appears to have a good sort method, but when sorting array elements
 that are very large, and hence have very expensive compares, is there some
 sort of already-available sort function that will merge like elements into
 a chain, so that they won't have to be recompared as many times?

 Thanks!

might be simpler to memoize  cmp(), look in online cookbook or
something like this decorator

http://mail.python.org/pipermail/python-list/2005-October/303035.html
http://aspn.activestate.com/ASPN/Python/Cookbook/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread bonono

Dan Stromberg wrote:
 Hi folks.

 Python appears to have a good sort method, but when sorting array elements
 that are very large, and hence have very expensive compares, is there some
 sort of already-available sort function that will merge like elements into
 a chain, so that they won't have to be recompared as many times?
 
 Thanks!
Sounds like DSU time.

[a] - [ (hash(a), a) ]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread gene tani

[EMAIL PROTECTED] wrote:
 Dan Stromberg wrote:
  Hi folks.
 
  Python appears to have a good sort method, but when sorting array elements
  that are very large, and hence have very expensive compares, is there some
  sort of already-available sort function that will merge like elements into
  a chain, so that they won't have to be recompared as many times?
 
  Thanks!
 Sounds like DSU time.

 [a] - [ (hash(a), a) ]

Aha!  OR: take a log of the array, e.g. log base 10 or some other
monotonic transform and permutation order indexes
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/306862

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread bonono

gene tani wrote:
 [EMAIL PROTECTED] wrote:
  Dan Stromberg wrote:
   Hi folks.
  
   Python appears to have a good sort method, but when sorting array elements
   that are very large, and hence have very expensive compares, is there some
   sort of already-available sort function that will merge like elements into
   a chain, so that they won't have to be recompared as many times?
  
   Thanks!
  Sounds like DSU time.
 
  [a] - [ (hash(a), a) ]

 Aha!  OR: take a log of the array, e.g. log base 10 or some other
 monotonic transform and permutation order indexes
 http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/306862
I may have made a mistaken in that hash(a) should be some function that
returns the order of a, rather than the built-in hash() function.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Ben Sizer
Dan Stromberg wrote:
 Python appears to have a good sort method, but when sorting array elements
 that are very large, and hence have very expensive compares, is there some
 sort of already-available sort function that will merge like elements into
 a chain, so that they won't have to be recompared as many times?

It's not really practical - if the list is unsorted, it's non-trivial
to determine how many times a given element is duplicated until you've
compared it with everything else. That is roughly an O(N*N/2) operation
whereas sorting typically is O(NlogN). This is why C++'s 'std::unique'
function only operates on sorted lists.

So instead, one alternative would be to use a comparison function that
takes the 2 objects and looks for the pair in a dictionary: if that
pair is not found, perform the normal comparison and store it in the
dictionary for next time, and if it is found, just return it. This way
the actual comparison is only done once for each pair.

Alternatively you might be able to produce a cheap comparison from the
expensive one if you can summarise the information in a simpler form.
Perhaps each sorting criterion can be represented as an integer, and
you can instead sort a list of lists containing integers.

-- 
Ben Sizer

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Kent Johnson
[EMAIL PROTECTED] wrote:
 Dan Stromberg wrote:
Python appears to have a good sort method, but when sorting array elements
that are very large, and hence have very expensive compares, is there some
sort of already-available sort function that will merge like elements into
a chain, so that they won't have to be recompared as many times?
 
 Sounds like DSU time.
 
 [a] - [ (hash(a), a) ]

This won't work - elements with different hashes will sort by hash and 
elements with the same hash will still be compared which is exactly what 
the OP is trying to avoid.

If there is some function of the arrays which sorts in the same order as 
the natural comparison then that function can be used as a sort key.
sort(arrayList, key=some_proxy_function)

Kent
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Paul Rubin
Kent Johnson [EMAIL PROTECTED] writes:
  [a] - [ (hash(a), a) ]
 
 This won't work - elements with different hashes will sort by hash and
 elements with the same hash will still be compared which is exactly
 what the OP is trying to avoid.

ds = sorted([(hash(c), i) for i,c in enumerate(a)])
dsu = [a[i] for hc,i in ds]

Is that what you mean?  I didn't answer the OP because I couldn't
understand the original question.  The above brings together clusters
of elements with the same hash, so if the clusters are large you can
finish the sort with relatively few comparisons.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Aahz
In article [EMAIL PROTECTED],
Dan Stromberg  [EMAIL PROTECTED] wrote:

Python appears to have a good sort method, but when sorting array
elements that are very large, and hence have very expensive compares,
is there some sort of already-available sort function that will merge
like elements into a chain, so that they won't have to be recompared as
many times?

I'll just note that Python's sort is specifically designed to reduce the
number of compares precisely because *all* compares in Python are
relatively expensive.  I'll suggest a variant on the previous suggestion
of hash:

[a] - [hash(a), index(a), a]
-- 
Aahz ([EMAIL PROTECTED])   * http://www.pythoncraft.com/

Don't listen to schmucks on USENET when making legal decisions.  Hire
yourself a competent schmuck.  --USENET schmuck (aka Robert Kern)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Dan Stromberg
On Thu, 22 Dec 2005 22:06:42 +, Dan Stromberg wrote:

 
 Hi folks.
 
 Python appears to have a good sort method, but when sorting array elements
 that are very large, and hence have very expensive compares, is there some
 sort of already-available sort function that will merge like elements into
 a chain, so that they won't have to be recompared as many times?
 
 Thanks!

I probably should've been more specific.

I'm wanting to sort a large number of files, like a bunch of output files
from a large series of rsh or ssh outputs on a large series of distinct
machines, a music collection in .ogg format (strictly redistributable and
legally purchased music), a collection of .iso cdrom images (strictly
redistributable and legally purchased software), and so forth.

I'm not sorting the content of each file individually.

I'm treating each file as a potentially very large string, and sorting
the strings.

I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file

(If #1 and #2 are identical, then the file must be a hardlink to the other
file.  Also, files that do not have the same length can never be
identical.  And of course these items are generated on demand, making it
frequently possible to avoid doing full-file-length compare on a lot of
files)

However, my program is still doing more #6 comparisons than seems
strictly necessary when I could just toss all the filenames describing
identical files into a list, and avoid re-comparing files with identical
content over and over - I don't want to compare them to each other again
and again), when there are a lot of identical files in the input list.

Thanks!

def __cmp__(self,other):
#   sys.stderr.write('Comparing %s and %s\n' % (self.name, 
other.name))
if verbosity = 1:
sys.stderr.write('Comparing file_class objects %s and 
%s\n' %
(self.name, other.name))
if self.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file 
%s\n' % self.name)
result = os.stat(self.name)
self.device = result[stat.ST_DEV]
self.inode = result[stat.ST_INO]
self.size = result[stat.ST_SIZE]
if other.device == -1:
if verbosity:
sys.stderr.write('Getting stat data for file 
%s\n' % other.name)
result = os.stat(other.name)
other.device = result[stat.ST_DEV]
other.inode = result[stat.ST_INO]
other.size = result[stat.ST_SIZE]
if self.device == other.device and self.inode == other.inode:
# same device, same inode, must be identical files 
return 0
if self.length  other.length:
return -1
elif self.length  other.length:
return 1
# if we've reached this point, the files are not hardlinks, and 
their
lengths are identical # so slurp in the prefixes if needed, 
then compare
them if self.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' 
% self.name)
file = open(self.name, 'r')
self.prefix = file.read(self.max_prefix_length) 
file.close()
if other.prefix == -1:
if verbosity:
sys.stderr.write('Getting prefix for file %s\n' 
% other.name)
file = open(other.name, 'r')
other.prefix = file.read(self.max_prefix_length) 
file.close()
if self.prefix  other.prefix:
return -1
elif self.prefix  other.prefix:
return 1
# if we've reached this point, we know that: # 1) The files are 
not
hardlinks of each other # 2) They have identical sizes # 3) 
Their
prefixes are identical
# 4) We haven't yet tried doing a cryptographic hash, so 
compute them if
needed, and compare them if self.hash == '':
self.gen_hash()
if other.hash == '':
other.gen_hash()
if self.hash  other.hash:
return -1
elif self.hash  other.hash:
return 1
# if we've reached this point, we know that: # 1) The files are 
not
hardlinks of each other # 2) They have identical sizes # 3) 
Their
prefixes are identical
  

Re: sorting with expensive compares?

2005-12-23 Thread bonono

Dan Stromberg wrote:
 On Thu, 22 Dec 2005 22:06:42 +, Dan Stromberg wrote:

 
  Hi folks.
 
  Python appears to have a good sort method, but when sorting array elements
  that are very large, and hence have very expensive compares, is there some
  sort of already-available sort function that will merge like elements into
  a chain, so that they won't have to be recompared as many times?
 
  Thanks!

 I probably should've been more specific.

 I'm wanting to sort a large number of files, like a bunch of output files
 from a large series of rsh or ssh outputs on a large series of distinct
 machines, a music collection in .ogg format (strictly redistributable and
 legally purchased music), a collection of .iso cdrom images (strictly
 redistributable and legally purchased software), and so forth.

 I'm not sorting the content of each file individually.

 I'm treating each file as a potentially very large string, and sorting
 the strings.

 I've been using the following compare function, which in short checks, in
 order:

 1) device number
 2) inode number
 3) file length
 4) the beginning of the file
 5) an md5 hash of the entire file
 6) the entire file

 (If #1 and #2 are identical, then the file must be a hardlink to the other
 file.  Also, files that do not have the same length can never be
 identical.  And of course these items are generated on demand, making it
 frequently possible to avoid doing full-file-length compare on a lot of
 files)

 However, my program is still doing more #6 comparisons than seems
 strictly necessary when I could just toss all the filenames describing
 identical files into a list, and avoid re-comparing files with identical
 content over and over - I don't want to compare them to each other again
 and again), when there are a lot of identical files in the input list.

Why would #5 not enough as an indicator that the files are indentical ?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Alex Martelli
Dan Stromberg [EMAIL PROTECTED] wrote:
   ...
 I'm wanting to sort a large number of files, like a bunch of output files
 from a large series of rsh or ssh outputs on a large series of distinct
 machines, a music collection in .ogg format (strictly redistributable and
 legally purchased music), a collection of .iso cdrom images (strictly
 redistributable and legally purchased software), and so forth.
 
 I'm not sorting the content of each file individually.
 
 I'm treating each file as a potentially very large string, and sorting
 the strings.

OK, very clear.

 I've been using the following compare function, which in short checks, in
 order:
 
 1) device number
 2) inode number
 3) file length
 4) the beginning of the file
 5) an md5 hash of the entire file
 6) the entire file

Makes sense, including the possibility of computing some items on
demand.  However, using a key-callable rather than a cmp-callable may
still make sense -- you just need a callable that extracts the
attributes on demand (and caches them thereafter... assuming you have
enough memory to keep all the caches, I guess [6] might be a problem).
I do see that key-callables won't necessarily work easily here, but
since the key-callable's __cmp__ will in turn be called, that might
still work better... but, on reflection, it's probably just a minor
gain.

 However, my program is still doing more #6 comparisons than seems
 strictly necessary when I could just toss all the filenames describing
 identical files into a list, and avoid re-comparing files with identical
 content over and over - I don't want to compare them to each other again
 and again), when there are a lot of identical files in the input list.

The comparison should never get called twice on the same two objects,
anyway.  So, I'm not sure what you think you could gain by optimizing
future comparisons of two objects once you've ascertained they are in
fact equal.  Still, assuming for example that self.name is a unique
identifier (i.e. the so-called name is in fact a complete path), the
optimization (memoization) is quite easy to perform.  Rename what you
currently have as __cmp__ by another name, say _real_cmp, add a
_compared dict to the class, and code the following __cmp__ ...:

_compared = {}
def __cmp__(self, other):
try: return -self._compared[other.name, self.name]
except KeyError: pass
key = self.name, other.name
if key in self._compared: return self._compared[key]
result = self_compared[key] = self._real_cmp(other)
return result

I would not expect this optimization to matter at all, because no key
should ever be already present in the self._compared dictionary (the
same two files should never, ever get compared twice anyway).

However, it might be possible to extend this idea by using the
properties you know an ordering should have -- if A and B have never
been compared, but both have been compared with C, in some cases you
don't need to compare A and B but may exploit already-known results.
For example, if A==C and B==C, you already know that A==B without any
actual comparison; if AC and B==C, you already know that AB; and so
on.  Of course in some cases you still must work, e.g. if you know AC
and BC that doesn't help.  I'm not sure if this would help at all,
either: it's quite possible that the sort algorithm already exploits
these possibilities internally.  But, perhaps, it may be worth a try.


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Peter Otten
Dan Stromberg wrote:

 I'm wanting to sort a large number of files, like a bunch of output files
 from a large series of rsh or ssh outputs on a large series of distinct
 machines, a music collection in .ogg format (strictly redistributable and
 legally purchased music), a collection of .iso cdrom images (strictly
 redistributable and legally purchased software), and so forth.

Are you really trying to establish an order or do want to eliminate the
duplicates?

 File(perfectly_legal.ogg)  File(free_of_charge.mp3)
True

doesn't make that much sense to me, regardless of what the comparison may
actually do.

Peter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Steve Holden
[EMAIL PROTECTED] wrote:
 Dan Stromberg wrote:
[...]
I've been using the following compare function, which in short checks, in
order:

1) device number
2) inode number
3) file length
4) the beginning of the file
5) an md5 hash of the entire file
6) the entire file
[...]
 Why would #5 not enough as an indicator that the files are indentical ?
 
Because it doesn't guarantee that the files are identical. It indicates, 
to a very high degree of probability (particularly when the file lengths 
are equal), that the two files are the same, but it doesn't guarantee it.

Technically there are in infinite number of inputs that can produce the 
same md5 hash.

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006  www.python.org/pycon/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Steven D'Aprano
On Fri, 23 Dec 2005 09:20:55 -0800, bonono wrote:

 
 Dan Stromberg wrote:

[snip]

 I've been using the following compare function, which in short checks, in
 order:

 1) device number
 2) inode number
 3) file length
 4) the beginning of the file
 5) an md5 hash of the entire file
 6) the entire file

 (If #1 and #2 are identical, then the file must be a hardlink to the other
 file.  Also, files that do not have the same length can never be
 identical.  And of course these items are generated on demand, making it
 frequently possible to avoid doing full-file-length compare on a lot of
 files)

 However, my program is still doing more #6 comparisons than seems
 strictly necessary when I could just toss all the filenames describing
 identical files into a list, and avoid re-comparing files with identical
 content over and over - I don't want to compare them to each other again
 and again), when there are a lot of identical files in the input list.

 Why would #5 not enough as an indicator that the files are indentical ?

(1) Because the MD5 algorithm does include collisions. I was going to say
rare collisions, but there is actually an infinite number of them. The
collisions are just distributed thinly -- because MD5 is a good hash
algorithm, *very* thinly.

(Proof of this comes from the pigeon-hole principle: there is an infinite
number of possible files of arbitrary length, and only a finite number of
possible hashes. Therefore, an infinite number of files must hash to each
possible hash.)

(2) Having said that, unless the original poster is dealing with billions
(plus) of files, it is unlikely that he is finding any of the collisions
unless there is a bug in his sort routine. Since he claims to be doing
more comparisions-by-file-contents than expected (I would suggest *one*
is more than expected), I predict a bug in his code, his algorithm, or
both.


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Steven D'Aprano
On Fri, 23 Dec 2005 19:26:11 +0100, Peter Otten wrote:

 Dan Stromberg wrote:
 
 I'm wanting to sort a large number of files, like a bunch of output files
 from a large series of rsh or ssh outputs on a large series of distinct
 machines, a music collection in .ogg format (strictly redistributable and
 legally purchased music), a collection of .iso cdrom images (strictly
 redistributable and legally purchased software), and so forth.
 
 Are you really trying to establish an order or do want to eliminate the
 duplicates?
 
 File(perfectly_legal.ogg)  File(free_of_charge.mp3)
 True
 
 doesn't make that much sense to me, regardless of what the comparison may
 actually do.

If I have understood the poster's algorithm correctly, it gets even
weirder:


Sorted list of files -

[parrot.ogg, redhat.iso, george.log, fred.log, rhino.ogg, cat.ogg,
debian.iso, sys_restore.iso, adrian.log, fox.ogg, ...]

It seems to this little black duck that by sorting by file contents in
this way, the effect to the human reader is virtually to randomise the
list of file names.

Even if you limit yourself to (say) a set of ogg files, and sort by the
binary contents -

# album-track
[parrot-6.ogg, rhino-1.ogg, cat-12.ogg, fox-2.ogg, parrot-3.ogg, ...]

most people looking at the list would guess it had been shuffled, not
sorted. So I too don't know what the original poster hopes to accomplish
by sorting on the content of large binary files.



-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Paul Rubin
Dan Stromberg [EMAIL PROTECTED] writes:
 I've been using the following compare function, which in short checks, in
 order:
 
 1) device number
 2) inode number
 3) file length
 4) the beginning of the file
 5) an md5 hash of the entire file
 6) the entire file
 
 (If #1 and #2 are identical, then the file must be a hardlink to the other
 file.  Also, files that do not have the same length can never be
 identical.  And of course these items are generated on demand, making it
 frequently possible to avoid doing full-file-length compare on a lot of
 files)
 
 However, my program is still doing more #6 comparisons than seems

Just don't do any #6 comparisons.  If the hashes are identical, the
files are identical--that's the point of a cryptographic hash.  OK, to
be pedantic:

   1) there's a microscopic chance of an accidental collision, but
   it's much lower than the chance of a hardware failure making your
   comparison function give the wrong answer.

   2) there are known cryptanalytic attacks against md5 that can let
   someone deliberately construct two distinct files with the same
   hash, with a fair amount of efort.  If you care about this, use
   sha-1 instead, or better yet, sha-256.  (sha-1 has an attack
   similar to the md5 attack, but the amount of work required is not
   really practical today, unlike the md5 attack).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Steven D'Aprano
On Fri, 23 Dec 2005 17:10:22 +, Dan Stromberg wrote:

 I probably should've been more specific.
 
 I'm wanting to sort a large number of files, like a bunch of output files
 from a large series of rsh or ssh outputs on a large series of distinct
 machines, a music collection in .ogg format (strictly redistributable and
 legally purchased music), a collection of .iso cdrom images (strictly
 redistributable and legally purchased software), and so forth.
 
 I'm not sorting the content of each file individually.
 
 I'm treating each file as a potentially very large string, and sorting
 the strings.

Which is a very strange thing to do, but I'll assume you have a good
reason for doing so.

 I've been using the following compare function, 

My first suggestion is that you invest a bit of time in building a
small-scale set of data that you can test your sorting on, if you haven't
already done so. On multi-megabyte files, taking the MD5 hash or comparing
the content byte by byte is quite slow. So I would spend a few minutes
populating a directory with a dozen or so fake iso and ogg files, only a
few bytes in length each, and run my tests on them.

 which in short checks, in order:
 
 1) device number
 2) inode number
 3) file length
 4) the beginning of the file
 5) an md5 hash of the entire file
 6) the entire file
 
 (If #1 and #2 are identical, then the file must be a hardlink to the
 other file.  Also, files that do not have the same length can never be
 identical.  And of course these items are generated on demand, making it
 frequently possible to avoid doing full-file-length compare on a lot of
 files)

My second suggestion is to encapsulate vigorously, at least for testing.
At the moment, you have a single __cmp__ function that does everything
in-line. I would do something like this:

def __cmp__(self, other):
result = compare_by_device_number_and_inode(self, other)
if result is None:
result = compare_by_file_length(self, other)
if result is None:
... 
return result

You can always unroll the function calls later, but for now it helps in
debugging: you only need to think about one type of comparison at a time,
and makes it easier to think about what each function needs to do.


 However, my program is still doing more #6 comparisons than seems
 strictly necessary when I could just toss all the filenames describing
 identical files into a list, and avoid re-comparing files with identical
 content over and over - I don't want to compare them to each other again
 and again), when there are a lot of identical files in the input list.

As others have pointed out, Python's sort never compares the same objects
more than once.

Some more comments follow interleaved with your code:


 def __cmp__(self,other):
 # sys.stderr.write('Comparing %s and %s\n' % (self.name, 
 other.name))

Are you sure you don't want to just sort by file name? That would be much
simpler *wink*

   if verbosity = 1:
   sys.stderr.write('Comparing file_class objects %s and 
 %s\n' %
   (self.name, other.name))
   if self.device == -1:
   if verbosity:
   sys.stderr.write('Getting stat data for file 
 %s\n' % self.name)
   result = os.stat(self.name)
   self.device = result[stat.ST_DEV]
   self.inode = result[stat.ST_INO]
   self.size = result[stat.ST_SIZE]

Since you need to compare stat data for all these objects, why not
simply pre-fetch them when you create the object? That way you can
simplify your __cmp__ function significantly.

[snip]

   if self.device == other.device and self.inode == other.inode:
   # same device, same inode, must be identical files 
 return 0

At first I thought that my news client had mangled your function, but
going back to the original post, either *your* client has mangled it, or
I've found a bug in your code. return 0 is commented out.

Is it possible that all the excess file comparisons are being done only
for hard links?

[snip]


   if verbosity:
   sys.stderr.write('Getting prefix for file %s\n' 
 % self.name)
   file = open(self.name, 'r')
   self.prefix = file.read(self.max_prefix_length) 
 file.close()

prefix is probably not the best name. Maybe header (as in file
header)?

If the prefix/header is short enough, don't worry about fetching on
demand, at least not while you are still debugging. Just pre-fetch it, get
your code working, and then convert back to fetching the header on demand.


   if other.prefix == -1:
   if verbosity:
   sys.stderr.write('Getting prefix for file %s\n' 
 % other.name)
   file = open(other.name, 'r')
   

Re: sorting with expensive compares?

2005-12-23 Thread Paul Rubin
Steven D'Aprano [EMAIL PROTECTED] writes:
 http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/
 
 the expected number of random unique files you would need to compare
 before finding a single collision in the MD5 hashes is (very roughly)
 10**70, or ten billion trillion trillion trillion trillion trillion.

That's not right, the right number is around 2**64 or 2e19.  See the
section of that page about the birthday attack.  It's still an awful lot.

There are also known ways of deliberately constructing md5 collisions
(i.e. md5 is broken).  Whether the OP should care about that depends
on the application.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Steven D'Aprano
On Fri, 23 Dec 2005 21:07:57 -0800, Paul Rubin wrote:

 Steven D'Aprano [EMAIL PROTECTED] writes:
 http://www.iusmentis.com/technology/encryption/pgp/pgpattackfaq/hash/
 
 the expected number of random unique files you would need to compare
 before finding a single collision in the MD5 hashes is (very roughly)
 10**70, or ten billion trillion trillion trillion trillion trillion.
 
 That's not right, the right number is around 2**64 or 2e19.  See the
 section of that page about the birthday attack.  It's still an awful lot.

Oh poot, a stupid typo! I entered 1e60 in my calculation instead of 1e6,
which is doubly embarrassing because the correct number to use is 1e9:

[quote]
In MD5's case, 2**64 messages need to be tried. This is not a feasible
attack given today's technology. If you could try 1,000,000 messages per
second, it would take 584,942 years to find a collision. A machine that
could try 1,000,000,000 messages per second would take 585 years, on
average.
[end quote]

So the expected number of files (messages) needed to check to find a
collision is:

585*365*24*60*60*1e9 = 1.8e19

or more than ten million million million, which of course equals 2**64.




 There are also known ways of deliberately constructing md5 collisions
 (i.e. md5 is broken).  Whether the OP should care about that depends
 on the application.

Sure, but I don't he is deliberately trying to sabotage his own files :-)


-- 
Steven.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Tim Peters
[Steven D'Aprano]
 ...
 As others have pointed out, Python's sort never compares the same objects
 more than once.

Others have pointed it out, and it's getting repeated now, but it's
not true.  Since I wrote Python's sorting implementation, it's
conceivable that I'm not just making this up ;-)

Here's the shortest counter-example:  [10, 30, 20].

The first thing sort() does is compute the length of the longest
natural run (whether increasing or decreasing) starting at index 0. 
It compares 10 and 30, sees that the list starts with an increasing
run, then compares 30 and 20 to see whether this run can be extended
to length 3.  It can't.  Since the list has only 3 elements, it
decides to use a binary insertion sort to move the 3rd element into
place.  That requires 2 comparisons, the first of which _again_
compares 30 and 20.

This doesn't bother me, and it would cost more to avoid this than it
could possibly save in general.  It's true that the workhorse binary
insertion and merge sorts never compare the same elements more than
once, but there is some potential duplication between those and
comparisons done to identify natural runs.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Paul Rubin
Steven D'Aprano [EMAIL PROTECTED] writes:
  There are also known ways of deliberately constructing md5 collisions
  (i.e. md5 is broken).  Whether the OP should care about that depends
  on the application.
 
 Sure, but I don't he is deliberately trying to sabotage his own files :-)

He might have downloaded a file created by a saboteur to have the same
md5 as some popular music file, but which contains a subliminal
hypnotic message which will brainwash him if played.  Using a stronger
hash, such as sha256, should protect him from this fate.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sorting with expensive compares?

2005-12-23 Thread Alex Martelli
Tim Peters [EMAIL PROTECTED] wrote:

 [Steven D'Aprano]
  ...
  As others have pointed out, Python's sort never compares the same objects
  more than once.
 
 Others have pointed it out, and it's getting repeated now, but it's
 not true.  Since I wrote Python's sorting implementation, it's
 conceivable that I'm not just making this up ;-)
   ...
 could possibly save in general.  It's true that the workhorse binary
 insertion and merge sorts never compare the same elements more than
 once, but there is some potential duplication between those and
 comparisons done to identify natural runs.

Since I probably was the first one to point out the erroneous factoid in
question, I apologize -- obviously, my memories of the inner workings of
sort were faulty, and didn't consider that potential duplication.

In this case, the tricks I already (though dubiously;-) suggested in
order to avoid any avoidable comparisons might pay for themselves and
then some, if comparisons are indeed extremely costly.


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list