Re: Non-identical files with identical md5sums on Debian systems?

2013-08-07 Thread Fabian Greffrath
Dear Peter,

Am Mittwoch, den 07.08.2013, 00:03 +0100 schrieb peter green: 
 The bottom line is under practical conditions the only way you
 are going to see two files with the same md5 is if someone went
 out of their way to create them and send them to you.

thank you very much for this insightful analysis, now I feel more
confident :)

Best regards,

- Fabian



-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1375858850.24465.17.camel@kff50



re: Non-identical files with identical md5sums on Debian systems?

2013-08-06 Thread peter green

I do occasionally check for identical files on different systems by
comparing their md5sums. So, just out of interest, could someone tell me
(how to find out) how many non-identical files with identical md5sums
there are there on a typical (say, amd64) Debian system?

Assuming the output of md5 is random uncorrelated 128 bit binary numbers
and making a couple of other approximations we can approximate the 
number with the formula.


((n*n-1)/2)/(2^128)

Where n is the number of unique files on your system.

I used the command  cat /var/lib/dpkg/info/*.list | wc -l to get an
approximation of the number of debian files on my main debian
system with lots of stuff installed. I will assume all these files
are unique.

plugwash@debian:~$ cat /var/lib/dpkg/info/*.list | wc -l
304431

So the expected number of md5 collisions would be approximately

((304431*304430)/2)/(2^128)

Plugging that into octave gives us an answer of

octave:1 ((304431*304430)/2)/(2^128)
ans =  1.3618e-28

The bottom line is under practical conditions the only way you
are going to see two files with the same md5 is if someone went
out of their way to create them and send them to you.



--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/520180ac.90...@p10link.net



Re: Non-identical files with identical md5sums on Debian systems?

2013-08-05 Thread Michael Welle
Hello,

Russ Allbery r...@debian.org writes:

 Fabian Greffrath fab...@greffrath.com writes:

 I do occasionally check for identical files on different systems by
 comparing their md5sums. So, just out of interest, could someone tell me
 (how to find out) how many non-identical files with identical md5sums
 there are there on a typical (say, amd64) Debian system?

 Unless you have a collection of MD5 collision attacks, or have installed a
 package that includes a sample MD5 collision, the changes are quite good
 that the answer is zero.  MD5 is no longer considered cryptographically
 strong, but that doesn't mean it's not a fairly random 128-bit hash.  You
 need a *lot* of files before even the birthday paradox will give you much
 likelihood of an MD5 collision that wasn't intentionally constructed.
exactly. And why don't you run a experiment, Fabian? I guess you have a
typical Debian system at your hands and calculating the MD5 hashes of
all distribution files burns only a few IOPs and CPU cycles ;).

Regards
hmw

PS: Let us see the results ;)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/8761vkljwm@luisa.c0t0d0s0.de



Re: Non-identical files with identical md5sums on Debian systems?

2013-08-05 Thread Chow Loong Jin
On Mon, Aug 05, 2013 at 06:44:49AM +0200, Fabian Greffrath wrote:
 Hi all,
 
 I do occasionally check for identical files on different systems by
 comparing their md5sums. So, just out of interest, could someone tell me
 (how to find out) how many non-identical files with identical md5sums
 there are there on a typical (say, amd64) Debian system?

How about this?


#!/bin/sh
cat /var/lib/dpkg/info/*.md5sums | sort -u  md5sums-files.txt
awk '{print $1}' md5sums-files.txt | uniq -c | awk '$1  1 {print $2}'  dup.txt

while read md5; do
grep ^$md5 md5sums-files.txt | sed -re 's/^[a-f0-9]+[[:space:]]+//' |
(
read file
shasum1=$(sha256sum $file | awk '{print $1}')

while read file; do
if [ $(sha256sum $file | awk '{print $1}') != $shasum1 ]; then
echo $md5 $file
fi
done
)
done  dup.txt


I tried running it, didn't find anything on my Ubuntu installation.

-- 
Kind regards,
Loong Jin


signature.asc
Description: Digital signature


Re: Non-identical files with identical md5sums on Debian systems?

2013-08-05 Thread Helmut Grohne
On Sun, Aug 04, 2013 at 10:24:59PM -0700, Vincent Cheng wrote:
 On Sun, Aug 4, 2013 at 9:44 PM, Fabian Greffrath fab...@greffrath.com wrote:
  I do occasionally check for identical files on different systems by
  comparing their md5sums. So, just out of interest, could someone tell me
  (how to find out) how many non-identical files with identical md5sums
  there are there on a typical (say, amd64) Debian system?
 
 The closest thing to what you want may be dedup.debian.net, but I
 don't think it lets you filter out non-identical files.

Indeed this task can be solved with the software backing
dedup.debian.net. The general assumption is that sha512 is
collision-free. I can give a rough idea on how to do that:

1) Obtain the software.
2) Modify schema.sql to add md5 to the functions table.
3) Modify importpkg.py to record md5 hashes.
4) Follow the steps in README to import a local Debian mirror.
   (This takes about 7 hours on a quick 8 core box and 3 days on a
   slower single core.)
5) Look for files, that have same md5 hash, but different sha512 hash.
   Something like this SQL query will give you an answer (untested).

   SELECT h1.cid, h2.cid FROM hash AS h1 JOIN hash AS h2 ON h1.fid = h2.fid AND 
h1.hash = h2.hash JOIN hash AS h3 ON h1.cid = h3.cid JOIN hash AS h4 ON h2.cid 
= h4.cid AND h3.fid = h4.fid JOIN function AS f1 ON h1.fid = f1.id JOIN 
function AS f3 ON h3.fid = f3.id WHERE h3.hash != h4.hash AND f1.name = 'md5' 
AND f3.name = 'sha512';

   It gives keys into the content table to look up the actual filenames
   and packages.

In case you have any questions, just ask (mail or #-qa on oftc).

Helmut


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805084636.ga10...@alf.mars



Re: Non-identical files with identical md5sums on Debian systems?

2013-08-05 Thread Adam Borowski
On Sun, Aug 04, 2013 at 10:21:09PM -0700, Russ Allbery wrote:
 Fabian Greffrath fab...@greffrath.com writes:
 
  I do occasionally check for identical files on different systems by
  comparing their md5sums. So, just out of interest, could someone tell me
  (how to find out) how many non-identical files with identical md5sums
  there are there on a typical (say, amd64) Debian system?
 
 Unless you have a collection of MD5 collision attacks, or have installed a
 package that includes a sample MD5 collision, the changes are quite good
 that the answer is zero.  MD5 is no longer considered cryptographically
 strong, but that doesn't mean it's not a fairly random 128-bit hash.  You
 need a *lot* of files before even the birthday paradox will give you much
 likelihood of an MD5 collision that wasn't intentionally constructed.

Let's assume every hard drive produced so far in human history is combined
in a single RAID0 array, and formatted using a typical filesystem without
an inode limit, then filled with small files.  If my estimate is correct,
thanks to the birthday paradox there's around 0.001% chance there will be
at least one non-constructed MD5 collision.

Also, there is no known preimage attack against MD5; collision attacks are
quite less dangerous as the attacker would need to first give you a
legitimate version of the file she wants to replace.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805100834.ga2...@angband.pl



Re: Non-identical files with identical md5sums on Debian systems?

2013-08-05 Thread Ian Jackson
Russ Allbery writes (Re: Non-identical files with identical md5sums on Debian 
systems?):
 Unless you have a collection of MD5 collision attacks, or have installed a
 package that includes a sample MD5 collision, [...]

For the sake of sanity of our (still) MD5-based tools, I hope that
no-one uploads into our archive a package with an example MD5
collision.  (Unless the colliding files are wrapped up somehow, to
protect our infrastructure from any untoward behaviour.)

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20991.42365.739458.834...@chiark.greenend.org.uk



Re: Non-identical files with identical md5sums on Debian systems?

2013-08-05 Thread Chow Loong Jin
On Mon, Aug 05, 2013 at 02:15:41PM +0100, Ian Jackson wrote:
 Russ Allbery writes (Re: Non-identical files with identical md5sums on 
 Debian systems?):
  Unless you have a collection of MD5 collision attacks, or have installed a
  package that includes a sample MD5 collision, [...]
 
 For the sake of sanity of our (still) MD5-based tools, I hope that
 no-one uploads into our archive a package with an example MD5
 collision.  (Unless the colliding files are wrapped up somehow, to
 protect our infrastructure from any untoward behaviour.)

What in our infrastructure would break on an MD5 collision anyway? The closest
thing I could think of is dedup.debian.net, but that appears to use SHA512.

-- 
Kind regards,
Loong Jin


signature.asc
Description: Digital signature


Re: Non-identical files with identical md5sums on Debian systems?

2013-08-04 Thread Russ Allbery
Fabian Greffrath fab...@greffrath.com writes:

 I do occasionally check for identical files on different systems by
 comparing their md5sums. So, just out of interest, could someone tell me
 (how to find out) how many non-identical files with identical md5sums
 there are there on a typical (say, amd64) Debian system?

Unless you have a collection of MD5 collision attacks, or have installed a
package that includes a sample MD5 collision, the changes are quite good
that the answer is zero.  MD5 is no longer considered cryptographically
strong, but that doesn't mean it's not a fairly random 128-bit hash.  You
need a *lot* of files before even the birthday paradox will give you much
likelihood of an MD5 collision that wasn't intentionally constructed.

-- 
Russ Allbery (r...@debian.org)   http://www.eyrie.org/~eagle/


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87li4gogqi@windlord.stanford.edu



Re: Non-identical files with identical md5sums on Debian systems?

2013-08-04 Thread Vincent Cheng
On Sun, Aug 4, 2013 at 9:44 PM, Fabian Greffrath fab...@greffrath.com wrote:
 Hi all,

 I do occasionally check for identical files on different systems by
 comparing their md5sums. So, just out of interest, could someone tell me
 (how to find out) how many non-identical files with identical md5sums
 there are there on a typical (say, amd64) Debian system?

The closest thing to what you want may be dedup.debian.net, but I
don't think it lets you filter out non-identical files.

Regards,
Vincent


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/caczd_tcqeftp3si47fzhgtfejf0zwz-ys6_kaaee2jvwnse...@mail.gmail.com