Re: A sets algorithm

2016-02-08 Thread Chris Angelico
On Tue, Feb 9, 2016 at 3:13 PM, Steven D'Aprano
 wrote:
> On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
>
>> That's fine for comparing one file against one other. He started out
>> by saying he already had a way to compare files for equality. What he
>> wants is a way to capitalize on that to find all the identical files
>> in a group. A naive approach would simply compare every file against
>> every other, for O(N*N) comparisons - but a hash lookup can make that
>> O(N) on the files themselves, plus (I think) an O(N log N) hash
>> comparison job, which has much lower constant factors.
>
> You're mixing up N's there :-) In the first two (compare every file against
> every other file, and hash lookup), N stands for the number of files. But in
> the hash comparison, well, I'm not sure what you mean by that, unless you
> mean the calculation of the hash itself, which will be O(M) where M is the
> size of the file.
>
> Unfortunately, even using a hash, you still have to do O(N**2) work, or
> rather, O(N*M) where N is the number of files and M is their size.
>

My intention was that every N was "number of files being compared",
but it's possible I made a mistake elsewhere. Assuming the hashes
become integers too large for a bucket sort (which they certainly
will, unless you want inordinate numbers of hash collisions), the code
would look something like this:

hash_to_filename = defaultdict(list)
for fn in files:
# Step 1: Hash every file.
hash = calculate_hash(fn)

# Step 2: Locate all pairs of files with identical hashes
hash_to_filename[hash].append(fn)


This loop executes once for each file, so calculate_hash() is called
once for each file. The cost of that is O(N), or possibly O(N*M). If
you're fairly sure the files are going to differ in size, you could
use the file size *as* the hash - it's cheap to calculate (no M
component whatsoever - just a stat call, and even that could be
optimized away in some cases, eg if you're scanning a whole
directory), but isn't cryptographically secure and ignores file
content altogether.

The second step involves one dictionary lookup or insertion for each
file. That's an O(log N) operation, where N is the number of elements
in the dictionary. So yes, it's not quite the same as the number of
files (if every file exists exactly twice, this N will be half the
other N), but it's broadly going to correspond. And an O(log N)
operation performed N times has an overall cost of O(N log N).

I might have something wrong here, but definitely I meant to have the
Ns all mean the same thing, like X on a Magic: The Gathering card. :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-08 Thread Steven D'Aprano
On Tuesday 09 February 2016 02:11, Chris Angelico wrote:

> That's fine for comparing one file against one other. He started out
> by saying he already had a way to compare files for equality. What he
> wants is a way to capitalize on that to find all the identical files
> in a group. A naive approach would simply compare every file against
> every other, for O(N*N) comparisons - but a hash lookup can make that
> O(N) on the files themselves, plus (I think) an O(N log N) hash
> comparison job, which has much lower constant factors.

You're mixing up N's there :-) In the first two (compare every file against 
every other file, and hash lookup), N stands for the number of files. But in 
the hash comparison, well, I'm not sure what you mean by that, unless you 
mean the calculation of the hash itself, which will be O(M) where M is the 
size of the file.

Unfortunately, even using a hash, you still have to do O(N**2) work, or 
rather, O(N*M) where N is the number of files and M is their size.


Assuming that duplicate files are relatively rare, the best way to do this 
is to collate files with the same size. Looking up the file size ought to be 
pretty fast. It is an IO operation, but it is approximately constant time in 
the sense that it doesn't actually depend on the size of the file.

(If you're using a programming language or operating system where the only 
way to find out how big a file is to open it and count the bytes, this 
obviously won't work for you.)


So let's collect files into a separate bucket for each unique file size:

bysize = {}
for file in files:
bysize.setdefault(file.getsize(), []).append(file)



In practice, you may find that nearly all files have different sizes, and 
most of the bucks have only a single file in them. Those files you can just 
ignore, as they must be unique.

So instead of having to compare N files against each of the others, you are 
left with a much smaller number of buckets, each with (hopefully) only two 
or three files of the same size. You might have gone from N=100 to 
thirty buckets with five files each.

And most of those files will probably differ on the very first byte, or at 
least within the first 16K of bytes, so it's hardly worth hashing them 
(especially if the hash algorithm is an expensive cryptographic hash, or if 
the file is large).

I would delay the hash comparison as long as possible, something like this:


def __hash__(self):
if self._hash is None:
self._hash = calculate_hash()  # expensive, so we cache it
return self._hash


def __eq__(self, other):
size = self.getsize()
if size != other.getsize():
return False
if self._hash and other._hash:
# Both hashes are already done, so might as well use them.
return (
hash(self) == hash(other) 
# in case of collisions...
and self.read() == other.read()
)
if size < MAXSIZE:
return self.read(size) == other.read(size)
else:
if self.read(MAXSIZE) != other.read(MAXSIZE):
return False
else:
 return (
hash(self) == hash(other) 
# in case of collisions...
and self.read() == other.read()
)


If your hash is strong enough that collisions are vanishingly rare, you 
could ignore the `and self.read() == other.read()` check.





-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-08 Thread Gregory Ewing

Chris Angelico wrote:

hash_to_filename = defaultdict(list)
for fn in files:
# Step 1: Hash every file.
hash = calculate_hash(fn)

# Step 2: Locate all pairs of files with identical hashes
hash_to_filename[hash].append(fn)


I think you can avoid hashing the files altogether.

First divide the files into groups of the same size. Then for
each group, open all the files at once, read them in parallel
and compare them with each other.

As soon as you find a difference, split the group into
smaller groups. When a group gets down to just one file, you
can stop reading that file.

Assuming that most of the differing files differ near the
beginning, this strategy means that you will hardly ever
have to read a whole file. Hashing the files beforehand,
in contrast, requires reading all of every file every
time.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-08 Thread Random832
On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
>   a) Use Chris Angelico's suggestion and hash each of the files (use the 
> standard library's 'hashlib' for this).  Identical files will always have 
> identical hashes, but there may be false positives, so you'll need to verify 
> that files that have identical hashes are indeed identical.
>   b) If your files tend to have sections that are very different (e.g., 
> the first 32 bytes tend to be different), then you pretend that section of 
> the file is its hash.  You can then do the same trick as above. (the 
> advantage of this is that you will read in a lot less data than if you have 
> to hash the entire file).
>   c) You may be able to do something clever by reading portions of each 
> file.  That is, use zip() combined with read(1024) to read each of the files 
> in sections, while keeping hashes of the files.  Or, maybe you'll be able to 
> read portions of them and sort the list as you're reading.  In either case, 
> if any files are NOT identical, then you'll be able to stop work as soon as 
> you figure this out, rather than having to read the entire file at once.
> 
> The main purpose of these suggestions is to reduce the amount of reading
> you're doing.

hashing a file using a conventional hashing algorithm requires reading
the whole file. Unless the files are very likely to be identical _until_
near the end, you're better off just reading the first N bytes of both
files, then the next N bytes, etc, until you find somewhere they're
different. The filecmp module may be useful for this.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-08 Thread Chris Angelico
On Tue, Feb 9, 2016 at 1:49 AM, Random832  wrote:
> On Sun, Feb 7, 2016, at 20:07, Cem Karan wrote:
>>   a) Use Chris Angelico's suggestion and hash each of the files (use the 
>> standard library's 'hashlib' for this).  Identical files will always have 
>> identical hashes, but there may be false positives, so you'll need to verify 
>> that files that have identical hashes are indeed identical.
>>   b) If your files tend to have sections that are very different (e.g., 
>> the first 32 bytes tend to be different), then you pretend that section of 
>> the file is its hash.  You can then do the same trick as above. (the 
>> advantage of this is that you will read in a lot less data than if you have 
>> to hash the entire file).
>>   c) You may be able to do something clever by reading portions of each 
>> file.  That is, use zip() combined with read(1024) to read each of the files 
>> in sections, while keeping hashes of the files.  Or, maybe you'll be able to 
>> read portions of them and sort the list as you're reading.  In either case, 
>> if any files are NOT identical, then you'll be able to stop work as soon as 
>> you figure this out, rather than having to read the entire file at once.
>>
>> The main purpose of these suggestions is to reduce the amount of reading
>> you're doing.
>
> hashing a file using a conventional hashing algorithm requires reading
> the whole file. Unless the files are very likely to be identical _until_
> near the end, you're better off just reading the first N bytes of both
> files, then the next N bytes, etc, until you find somewhere they're
> different. The filecmp module may be useful for this.

That's fine for comparing one file against one other. He started out
by saying he already had a way to compare files for equality. What he
wants is a way to capitalize on that to find all the identical files
in a group. A naive approach would simply compare every file against
every other, for O(N*N) comparisons - but a hash lookup can make that
O(N) on the files themselves, plus (I think) an O(N log N) hash
comparison job, which has much lower constant factors. The key here is
the hashing algorithm though.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Chris Angelico
On Mon, Feb 8, 2016 at 8:46 AM, Paulo da Silva
 wrote:
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
>
> Thanks for any suggestions.

Hash them in some way.

This has two costs:
1) You need to figure out some hashing algorithm such that any two
equal files have the same hash, and ideally, that unequal files will
generally have unequal hashes.
2) Hash each file once, and then do your comparisons on the hashes,
finally testing for actual equality only on those with the same hash.
(The last step takes care of hash collisions - where different files
happen to have the same hash - so ideally, you shouldn't have to do
this often.)

If your definition of "equal" among MyFiles is simply based on the
file content, it's easy - just hash the content. But if there are ways
for files to be considered equal without being bit-for-bit identical,
you'll have to reflect that in the hash. For instance, if you consider
files equal if they differ only in whitespace, then you'd need to
convert all whitespace to a single space before hashing; or if you
don't care about the order of lines in the file, you could either hash
the lines separately and sum/xor the hashes, or sort the lines before
hashing. But the rest of the job is pretty straight-forward.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Oscar Benjamin
On 7 Feb 2016 21:51, "Paulo da Silva" 
wrote:
>
> Hello!
>
> This may not be a strict python question, but ...
>
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
>
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?

If you can make the MyFiles hashable then this is easily done with
defaultdict(list).

--
Oscar
-- 
https://mail.python.org/mailman/listinfo/python-list


A sets algorithm

2016-02-07 Thread Paulo da Silva
Hello!

This may not be a strict python question, but ...

Suppose I have already a class MyFile that has an efficient method (or
operator) to compare two MyFile s for equality.

What is the most efficient way to obtain all sets of equal files (of
course each set must have more than one file - all single files are
discarded)?

Thanks for any suggestions.
Paulo
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Paulo da Silva
Às 22:17 de 07-02-2016, Tim Chase escreveu:
> On 2016-02-07 21:46, Paulo da Silva wrote:
...

> 
> If you the MyFile objects can be unique but compare for equality
> (e.g. two files on the file-system that have the same SHA1 hash, but
> you want to know the file-names), you'd have to do a paired search
> which would have worse performance and would need to iterate over the
> data multiple times:
> 
>   all_files = list(generate_MyFile_objects())
>   interesting = [
> (my_file1, my_file2)
> for i, my_file1
> in enumerate(all_files, 1)
> for my_file2
> in all_files[i:]
> if my_file1 == my_file2
> ]
> 
"my_file1 == my_file2" can be implemented into MyFile class taking
advantage of caching sizes (if different files are different), hashes or
even content (for small files) or file headers (first n bytes).
However this seems to have a problem:
all_files: a b c d e ...
If a==b then comparing b with c,d,e is useless.

May be using several steps with dict - sizes, then hashes for same sizes
files, etc ...

Another solution I thought of, could be defining some methods (I still
don't know which ones) in MyFile so that I could use sets intersection.
Would this one be a faster solution?

Thanks

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Tim Chase
On 2016-02-07 21:46, Paulo da Silva wrote:
> Suppose I have already a class MyFile that has an efficient method
> (or operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?

If your MyFile grows a __hash__ method that is based off the
uniqueness-qualities that you're comparing, you can use
collections.Counter and filter the results:

  from collections import Counter
  interesting = [
my_file
for my_file, count
in Counter(generate_MyFile_objects()).iteritems()
if count > 1
]

Which should be roughly O(n).  It also has the advantage of iterating
over your generator once.

If you the MyFile objects can be unique but compare for equality
(e.g. two files on the file-system that have the same SHA1 hash, but
you want to know the file-names), you'd have to do a paired search
which would have worse performance and would need to iterate over the
data multiple times:

  all_files = list(generate_MyFile_objects())
  interesting = [
(my_file1, my_file2)
for i, my_file1
in enumerate(all_files, 1)
for my_file2
in all_files[i:]
if my_file1 == my_file2
]

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Cem Karan

On Feb 7, 2016, at 4:46 PM, Paulo da Silva  
wrote:

> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 
> Thanks for any suggestions.

If you're after strict equality (every byte in a pair of files is identical), 
then here are a few heuristics that may help you:

1) Test for file length, without reading in the whole file.  You can use 
os.path.getsize() to do this (I hope that this is a constant-time operation, 
but I haven't tested it).  As Oscar Benjamin suggested, you can create a 
defaultdict(list) which will make it possible to gather lists of files of equal 
size.  This should help you gather your potentially identical files quickly.

2) Once you have your dictionary from above, you can iterate its values, each 
of which will be a list.  If a list has only one file in it, you know its 
unique, and you don't have to do any more work on it.  If there are two files 
in the list, then you have several different options:
a) Use Chris Angelico's suggestion and hash each of the files (use the 
standard library's 'hashlib' for this).  Identical files will always have 
identical hashes, but there may be false positives, so you'll need to verify 
that files that have identical hashes are indeed identical.
b) If your files tend to have sections that are very different (e.g., 
the first 32 bytes tend to be different), then you pretend that section of the 
file is its hash.  You can then do the same trick as above. (the advantage of 
this is that you will read in a lot less data than if you have to hash the 
entire file).
c) You may be able to do something clever by reading portions of each 
file.  That is, use zip() combined with read(1024) to read each of the files in 
sections, while keeping hashes of the files.  Or, maybe you'll be able to read 
portions of them and sort the list as you're reading.  In either case, if any 
files are NOT identical, then you'll be able to stop work as soon as you figure 
this out, rather than having to read the entire file at once.

The main purpose of these suggestions is to reduce the amount of reading you're 
doing.  Storage tends to be slow, and any tricks that reduce the number of 
bytes you need to read in will be helpful to you.

Good luck!
Cem Karan
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Tim Chase
On 2016-02-08 00:05, Paulo da Silva wrote:
> Às 22:17 de 07-02-2016, Tim Chase escreveu:
>>   all_files = list(generate_MyFile_objects())
>>   interesting = [
>> (my_file1, my_file2)
>> for i, my_file1
>> in enumerate(all_files, 1)
>> for my_file2
>> in all_files[i:]
>> if my_file1 == my_file2
>> ]
> 
> "my_file1 == my_file2" can be implemented into MyFile class taking
> advantage of caching sizes (if different files are different),
> hashes or even content (for small files) or file headers (first n
> bytes). However this seems to have a problem:
> all_files: a b c d e ...
> If a==b then comparing b with c,d,e is useless.

Depends on what the OP wants to have happen if more than one input
file is equal. I.e., a == b == c.  Does one just want "a has
duplicates" (and optionally "and here's one of them"), or does one
want "a == b", "a == c" and "b == c" in the output?

> Another solution I thought of, could be defining some methods (I
> still don't know which ones) in MyFile so that I could use sets
> intersection. Would this one be a faster solution?

Adding __hash__ would allow for the set operations, but would
require (as ChrisA points out) knowing how to create a hash function
that encompasses the information you want to compare.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: A sets algorithm

2016-02-07 Thread Paulo da Silva
Às 21:46 de 07-02-2016, Paulo da Silva escreveu:
> Hello!
> 
> This may not be a strict python question, but ...
> 
> Suppose I have already a class MyFile that has an efficient method (or
> operator) to compare two MyFile s for equality.
> 
> What is the most efficient way to obtain all sets of equal files (of
> course each set must have more than one file - all single files are
> discarded)?
> 

After reading all suggestions I decided to try first the
defaultdict(list), as first suggested by Oscar, in several steps. First
with sizes and then with other partial contents or/and "strong" hashes
as suggested by Cem.

Thank you very much to all who responded for all helpful suggestions.
If I find something better I'll report here.
Paulo

-- 
https://mail.python.org/mailman/listinfo/python-list