Re: SHA question

2010-01-16 Thread David Precious

Andy Wardley wrote:

On 14/01/2010 17:41, Philip Newton wrote:

Yes - you're missing the fact that in order to compute the differences
(which it has to if it doesn't want to transfer the whole file), it
has to read the entire file over the slow NFS link into your
computer's memory in order to compare it with the local file in
order to tell which pieces have changed.


No, I don't think it does.

My understanding[*] is that it computes a checksum for each block of a file
and only transmits blocks that have different checksums.


Of course, but to compute a checksum for each block of the file, that 
block first needs to be read, over the NFS connection, which is the 
whole issue.


Normally, rsync would be speaking to rsync running on the remote box, 
but the situation David described was one rsync process on box A, 
accessing files on box B via an NFS mount (as opposed to speaking to an 
rsync daemon on box B).


I'm not entirely sure, but I think that rsync will first compare the 
timestamps of the the two files, and if the timestamps match (to within 
the window specified with --modify-window, defaulting to an exact 
match), and the sizes match, it will consider the file to be the same, 
and skip generating checksums (so the file's data won't be read over NFS).


Re: SHA question

2010-01-15 Thread Andy Wardley

On 14/01/2010 17:41, Philip Newton wrote:

Yes - you're missing the fact that in order to compute the differences
(which it has to if it doesn't want to transfer the whole file), it
has to read the entire file over the slow NFS link into your
computer's memory in order to compare it with the local file in
order to tell which pieces have changed.


No, I don't think it does.

My understanding[*] is that it computes a checksum for each block of a file
and only transmits blocks that have different checksums.  That's how it
handles incremental changes on large files (e.g. an extra few lines at
the end of a log file doesn't require the whole file to be transmitted).

Some relevant options are:

   --checksum always checksum
   --block-size   checksum block size
   --whole-file   transmit the whole file
   --size-onlycompare file size instead of checksum

A

[*] which could be flawed



Re: SHA question

2010-01-15 Thread Roger Burton West
On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote:

 My understanding[*] is that it computes a checksum for each block of a file
 and only transmits blocks that have different checksums.

And to calculate the checksum on each block of the file, it has to, um,
read each block of the file... yes?

R


Re: SHA question

2010-01-15 Thread ian

On 15/01/2010 20:23, Roger Burton West wrote:

On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote:


My understanding[*] is that it computes a checksum for each block of a file
and only transmits blocks that have different checksums.


And to calculate the checksum on each block of the file, it has to, um,
read each block of the file... yes?

Doesn't rsync *push* rather than *pull* in which case the files it 
computes the checksum on are all local.


I did not think it worked in the way you mention without rsync daemon 
running at the remote end doing the checksum for you.


Re: SHA question

2010-01-15 Thread Ask Bjørn Hansen

On Jan 15, 2010, at 14:19, ian wrote:

 My understanding[*] is that it computes a checksum for each block of a file
 and only transmits blocks that have different checksums.
 
 And to calculate the checksum on each block of the file, it has to, um,
 read each block of the file... yes?
 
 Doesn't rsync *push* rather than *pull* in which case the files it computes 
 the checksum on are all local.
 
 I did not think it worked in the way you mention without rsync daemon running 
 at the remote end doing the checksum for you.

But with NFS the remote is local.  You need an rsync box running where the 
storage is to get cheaper checksums.


  - ask


Re: SHA question

2010-01-15 Thread Andy Wardley

On 15/01/2010 20:23, Roger Burton West wrote:

And to calculate the checksum on each block of the file, it has to, um,
read each block of the file... yes?


Sorry, I missed this bit in Philip's message:

 if both source and destination are on a local file system

I was thinking about remote comparisons.  In which case the remote rsync
daemon computes the checksum.  Yes, it has to read the entire file, but not
transmit it.

A


Re: SHA question

2010-01-14 Thread Peter Corlett
On 13 Jan 2010, at 17:53, David Cantrell wrote:
[...]
 Other hashing algorithms exist and are faster but more prone to
 inadvertant collisions.  If you've got a lot of data to compare, I'd
 use one of them (eg one of the variations on a CRC) and then only
 bring out the big SHA guns when that finds a collision.  

That's a premature optimisation which just complicates the code, unless you 
mean *a lot* such as in the rdiff algorithm.

For de-duping purposes, SHA is still faster than you can pull the files off the 
disk and a secondary cheaper hash is unnecessary.





Re: SHA question

2010-01-14 Thread Philip Newton
On Thu, Jan 14, 2010 at 13:22, Peter Corlett ab...@cabal.org.uk wrote:
 For de-duping purposes, SHA is still faster than you can pull the files off 
 the disk and a secondary cheaper hash is unnecessary.

That reminds me of how I was disappointed to find that rsync generally
transfers complete files (rather than diffs) if both source and
destination are on a local file system -- before I realised that to
compute the diffs, it would have to read the entire first and second
files, and if it's going to read the entire first file from disk
anyway, it can simply dump it over the second file without checking.
Computing diffs would be more work in this case, not less.

So yes, I suppose something similar applies here -- you have to read
the entire file anyway, so you might as well go with
SHA-$number_of_your_choice.

Cheers,
Philip
-- 
Philip Newton philip.new...@gmail.com


Re: SHA question

2010-01-14 Thread David Cantrell
On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:

 That reminds me of how I was disappointed to find that rsync generally
 transfers complete files (rather than diffs) if both source and
 destination are on a local file system -- before I realised that to
 compute the diffs, it would have to read the entire first and second
 files, and if it's going to read the entire first file from disk
 anyway, it can simply dump it over the second file without checking.
 Computing diffs would be more work in this case, not less.

Shame that local includes at the other end of a really slow NFS
connection to the other side of the world. Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.

-- 
David Cantrell | Reality Engineer, Ministry of Information


Re: SHA question

2010-01-14 Thread Roger Burton West
On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote:

Shame that local includes at the other end of a really slow NFS
connection to the other side of the world. Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.

Possibly I'm missing something, but: ssh?

R


Re: SHA question

2010-01-14 Thread Mark Fowler
On Wed, Jan 13, 2010 at 3:16 PM, Philip Newton philip.new...@gmail.com wrote:

 Along those lines, you may wish to store the filesize in bytes in your
 database as well, as a first point of comparison; if the filesize is
 unique, then the file must also be unique and you could save yourself
 the time spent calculating a digest of the file's contents -- no
 1058-byte file can be the same as any 1927-byte file.

This is only possible if you've still got all the pdfs on disk, as as
soon as you get your suspected duplicate you'll have to hash both
files' contents to tell if you have or not.  If you've sent them onto
a better place and deleted them however, then you're out of luck.

I'd just use Digest::MD5 to calculate the filesize.  It's cheap
compared to SHA, you don't care about the exact cryptographic security
of the hash, and will work even if you don't have the original to
compare again.

#!/usr/bin/perl

use Modern::Perl;
use autodie;
use Digest::MD5;

my $filename = shift;
open my $fh, :bytes, $filename;
my $md5 = Digest::MD5-new;
$md5-addfile($fh);
say The file's md5 is:  .$md5-b64digest;

Don't forget the :bytes (you're comparing bytes, not characters).

Once you've got a toy version up and running and you can get a feel
for how fast it is on your system, you can optimise if you don't like
the performance.

Mark.


Re: SHA question

2010-01-14 Thread David Cantrell
On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote:
 On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote:
 Shame that local includes at the other end of a really slow NFS
 connection to the other side of the world. Mind you, absent running the
 rsync daemon at the other end and using that instead of NFS, I'm not
 sure if there's a better way of doing it.
 Possibly I'm missing something, but: ssh?

That boils down to the same thing - it ends up invoking rsync at the
other end in daemon mode and talking the rsync protocol tunnelled
through ssh.

What I was getting at was that I don't see a better way of working if
you have to use a networky filesystem.

-- 
David Cantrell | Reality Engineer, Ministry of Information

  Longum iter est per praecepta, breve et efficax per exempla.


Re: SHA question

2010-01-14 Thread Matthew Boyle

David Cantrell wrote:

On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:


That reminds me of how I was disappointed to find that rsync generally
transfers complete files (rather than diffs) if both source and
destination are on a local file system -- before I realised that to
compute the diffs, it would have to read the entire first and second
files, and if it's going to read the entire first file from disk
anyway, it can simply dump it over the second file without checking.
Computing diffs would be more work in this case, not less.


Shame that local includes at the other end of a really slow NFS
connection to the other side of the world. Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.


the --no-whole-file option?  or am i missing something?

--matt


--
Matthew Boyle, Systems Administrator, CoreFiling Limited
Telephone: +44-1865-203192  Website: http://www.corefiling.com


Re: SHA question

2010-01-14 Thread Peter Corlett
On 14 Jan 2010, at 14:16, Mark Fowler wrote:
[...]
 I'd just use Digest::MD5 to calculate the filesize.  It's cheap
 compared to SHA, you don't care about the exact cryptographic security
 of the hash, and will work even if you don't have the original to
 compare again.

I assume you wrote filesize when you meant digest.

You should consider MD5 compromised unless you know for sure that your problem 
does not need to defend against the relatively low-effort birthday attack 
against it. At this point in time, you shouldn't be considering anything weaker 
than SHA-256 for new code.

Choosing the weak MD5 over SHA-256 because it's faster or produces a shorter 
key is just premature optimisation.





Re: SHA question

2010-01-14 Thread Matt Lawrence

David Cantrell wrote:

On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote:
  

On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote:


Shame that local includes at the other end of a really slow NFS
connection to the other side of the world. Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.
  

Possibly I'm missing something, but: ssh?



That boils down to the same thing - it ends up invoking rsync at the
other end in daemon mode and talking the rsync protocol tunnelled
through ssh.

What I was getting at was that I don't see a better way of working if
you have to use a networky filesystem.

  

Isn't this what the -W flag is for?

Matt


Re: SHA question

2010-01-14 Thread Matt Lawrence

Matthew Boyle wrote:

David Cantrell wrote:

On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:


That reminds me of how I was disappointed to find that rsync generally
transfers complete files (rather than diffs) if both source and
destination are on a local file system -- before I realised that to
compute the diffs, it would have to read the entire first and second
files, and if it's going to read the entire first file from disk
anyway, it can simply dump it over the second file without checking.
Computing diffs would be more work in this case, not less.


Shame that local includes at the other end of a really slow NFS
connection to the other side of the world. Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.


the --no-whole-file option?  or am i missing something?

This is of course what I was referring to when I mentioned the 
diametrically opposite option. *sigh*


Matt



Re: SHA question

2010-01-14 Thread Philip Newton
On Thu, Jan 14, 2010 at 16:20, Matthew Boyle
mlb-p...@decisionsoft.co.uk wrote:
 David Cantrell wrote:

 On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:

 That reminds me of how I was disappointed to find that rsync generally
 transfers complete files (rather than diffs) if both source and
 destination are on a local file system -- before I realised that to
 compute the diffs, it would have to read the entire first and second
 files, and if it's going to read the entire first file from disk
 anyway, it can simply dump it over the second file without checking.
 Computing diffs would be more work in this case, not less.

 Shame that local includes at the other end of a really slow NFS
 connection to the other side of the world. Mind you, absent running the
 rsync daemon at the other end and using that instead of NFS, I'm not
 sure if there's a better way of doing it.

 the --no-whole-file option?  or am i missing something?

Yes - you're missing the fact that in order to compute the differences
(which it has to if it doesn't want to transfer the whole file), it
has to read the entire file over the slow NFS link into your
computer's memory in order to compare it with the local file in
order to tell which pieces have changed.

So transferring the whole file is probably faster, at least under the
assumption that reading and writing are about the same speed over that
slow link. (If reading is much faster than writing, then you might
still save some time this way.)

Cheers,
Philip
-- 
Philip Newton philip.new...@gmail.com



Re: SHA question

2010-01-13 Thread Roger Burton West
On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote:

I have a lots of PDFs that I need to catalogue and I want to ensure
the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
something similar with SHA1 and binary files.  Am I right in thinking
that the code below is only taking the SHA on the name of the file and
if I want to ensure uniqueness of the content I need to do something
similar but as a file blob?

Yes.

You may want to be slightly cleverer about it - taking a SHAsum is
computationally expensive, and it's only worth doing if the files have
the same size.

If you don't require a pure-Perl solution, bear in mind that all this
has been done for you in the fdupes program, already in Debian or at
http://netdial.caribe.net/~adrian2/programs/ .

Roger


Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Roger Burton West ro...@firedrake.org:
 On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote:

I have a lots of PDFs that I need to catalogue and I want to ensure
the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
something similar with SHA1 and binary files.  Am I right in thinking
that the code below is only taking the SHA on the name of the file and
if I want to ensure uniqueness of the content I need to do something
similar but as a file blob?

 Yes.

 You may want to be slightly cleverer about it - taking a SHAsum is
 computationally expensive, and it's only worth doing if the files have
 the same size.

Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
but the majority are under 1mb. This application isn't for public
consumption so I don't have to worry about speed. However there are
other services on the server and I wouldn't want to blindly slurp a
50mb pdf I guess.

 If you don't require a pure-Perl solution, bear in mind that all this
 has been done for you in the fdupes program, already in Debian or at
 http://netdial.caribe.net/~adrian2/programs/ .

I am using it in a perl class but if I could system(`fdupes`) that
might be preferable. I'll try building the sources and see what
happens. Failing that I'll have to fallback to slurping and SHA or
MD5.

Thanx,
Dp.



Re: SHA question

2010-01-13 Thread Luis Motta Campos
Dermot wrote:
 Hi,
 
 I have a lots of PDFs that I need to catalogue and I want to ensure 
 the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned 
 something similar with SHA1 and binary files.  Am I right in thinking
  that the code below is only taking the SHA on the name of the file
 and if I want to ensure uniqueness of the content I need to do
 something similar but as a file blob?
 
 [code was here]
 

Yes, your code processes file names, not file contents.

 PS: I don't see many perl questions here, am I breaking a convention?

I believe the official answer to this question would be The London Perl
Mongers list considers on-topic messages that talk about Ponies, Buffy,
Beer, and Pie. Everything else should be tagged as 'off-toppic'.

As I'm really bad at remembering things and also a non-native speaker,
YMMV, wording- and semantic-wise.

Cheers
-- 
Luis Motta Campos is a software engineer,
Perl Programmer, foodie and photographer.


Re: SHA question

2010-01-13 Thread Steffan Davies
Dermot paik...@googlemail.com wrote at 12:44 on 2010-01-13:

 Hi,
 
 I have a lots of PDFs that I need to catalogue and I want to ensure
 the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
 something similar with SHA1 and binary files.  Am I right in thinking
 that the code below is only taking the SHA on the name of the file and
 if I want to ensure uniqueness of the content I need to do something
 similar but as a file blob?

Yes, that looks about right. From a brief look at
http://perldoc.perl.org/Digest/SHA.html it appears that you may want 

my $sha = Digest::SHA-new(512);
$sha-addfile($n);
$digest=$sha-digest; # or hexdigest or b64digest

in your inner loop.

S




Re: SHA question

2010-01-13 Thread Roger Burton West
On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote:

Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
but the majority are under 1mb.

No, that's _good_.

I am using it in a perl class

So I won't point out the implications, but there's an obvious one which
will make your life easier.

R


Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Roger Burton West ro...@firedrake.org:

I am using it in a perl class

 So I won't point out the implications, but there's an obvious one which
 will make your life easier.

You can't leave me hanging there
Dp.


Re: SHA question

2010-01-13 Thread Philip Potter
2010/1/13 Luis Motta Campos luismottacam...@yahoo.co.uk:
 I believe the official answer to this question would be The London Perl
 Mongers list considers on-topic messages that talk about Ponies, Buffy,
 Beer, and Pie. Everything else should be tagged as 'off-toppic'.

There is even a FAQ about this: http://london.pm.org/about/faq.html#topic

Having said that, I've been lurking here a few months now and I've
seen very little talk of any of the aforementioned topics D:

Phil


Re: SHA question

2010-01-13 Thread Avi Greenbury
Dermot wrote:
 2010/1/13 Roger Burton West ro...@firedrake.org:
  You may want to be slightly cleverer about it - taking a SHAsum is
  computationally expensive, and it's only worth doing if the files
  have the same size.
 
 Unfortunately the size varies quite a bit.

You might've missed his point.

If two files are of different sizes, they cannot be identical. Getting
the size of a file is substantially cheaper than hashing it.

So you check all your filesizes, and need only hash those pairs or
groups that are all the same size.

-- 
Avi Greenbury


Re: SHA question

2010-01-13 Thread James Laver
On Wed, Jan 13, 2010 at 1:46 PM, Dermot paik...@googlemail.com wrote:
 2010/1/13 Roger Burton West ro...@firedrake.org:

I am using it in a perl class

 So I won't point out the implications, but there's an obvious one which
 will make your life easier.

 You can't leave me hanging there
 Dp.


Well, there are a few things...

Firstly, you are indeed just hashing the filename, not the file contents.

Secondly, you're using Digest::SHA directly. The Digest:: series of
modules are meant to be used through the 'Digest' interface as in the
example Steffan gave. Doing this will make your life easier in most
cases (by providing a standard interface across almost all digest
algorithms and making it easy to switch (though ::Whirlpool disobeys
the rules of the interface :/ )) and provides the handy addfile method
you're looking for.

Thirdly, be aware of what hashing guarantees. It does *not* guarantee
uniqueness, it just gives you a very low chance that two files with
the same hash are different. It does guarantee that files with
different hashes are different, though.

Lastly, as regards on-topicness, Perl is definitely off-topic. Beer,
Pies, Dim Sum and Buffy are on-topic.*

On topic: Buffy eating a dim sum pie and washing it down with beer.

--James
* But you can still post perl here.


Re: SHA question

2010-01-13 Thread Philip Newton
On Wed, Jan 13, 2010 at 15:06, James Laver james.la...@gmail.com wrote:
 Thirdly, be aware of what hashing guarantees. It does *not* guarantee
 uniqueness, it just gives you a very low chance that two files with
 the same hash are different.

Well, that said, is the very low chance not on the order of the
chance that you'll be run over by a bus in the morning, or that one of
the files will be changed through cosmic rays or bit rot in the
magnetic domains of the hard disk platter?

In other words, is 1x10^-64 (or whatever it might be) not so small as
to be effectively zero, since there are much higher risks (say,
1x10^-32) which you do not guard against, either?

Cheers,
Philip
-- 
Philip Newton philip.new...@gmail.com


Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Avi Greenbury avismailinglistacco...@googlemail.com:

 You might've missed his point.

 If two files are of different sizes, they cannot be identical. Getting
 the size of a file is substantially cheaper than hashing it.

 So you check all your filesizes, and need only hash those pairs or
 groups that are all the same size.

Sorry guess I didn't make myself clear. I need to store the SHA in an
SQLite file. I have a few files to handle now but I will get a
constant dribble from now on. I want to try and ensure that I haven't
already databased a file that I'll process in the future.

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

SHA (b64digest) results:
mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
MR_2891.pdf  bQhWA445KFzXy6ldF/DSoG2xTEY 27
duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
MR_2898.pdf  ULBRZQB00qZIfIWD7oqdpfVpFtw 27
PR_A02.pdf   6LdF6sWZnyLdWj44inFI6MSaUY4 27
mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
PR_A01.pdf   JS33nJhzTo9YTqRWe01xnOb6bEM 27


 Thirdly, be aware of what hashing guarantees. It does *not* guarantee
 uniqueness, it just gives you a very low chance that two files with
 the same hash are different. It does guarantee that files with
 different hashes are different, though.


I think that's the best I can hope for. If that 'duplicate.pdf' turned
up again at least I be able to correctly identify it. That's the goal.
I will give fdupes a look too.
Thanks all.
Dp.


Re: SHA question

2010-01-13 Thread Peter Corlett
On 13 Jan 2010, at 14:40, Philip Newton wrote:
[...]
 Well, that said, is the very low chance not on the order of the
 chance that you'll be run over by a bus in the morning, or that one of
 the files will be changed through cosmic rays or bit rot in the
 magnetic domains of the hard disk platter?

In the case of SHA-256, the odds are low enough that the universe is likely to 
end before you find a collision.





Re: SHA question

2010-01-13 Thread Alexander Clouter
Roger Burton West ro...@firedrake.org wrote:

 You may want to be slightly cleverer about it - taking a SHAsum is
 computationally expensive, and it's only worth doing if the files have
 the same size.

 If you don't require a pure-Perl solution, bear in mind that all this
 has been done for you in the fdupes program, already in Debian or at
 http://netdial.caribe.net/~adrian2/programs/ .

*sigh*

The following gives the duplicated hashes (you might prefer '-D' instead
of '-d'):

md5sum /path/to/pdfs | sort | uniq -d


Replace the '-d' with '-u' if you want to just see the unique ones.

I'll leave it as an exercise for the reader to pipe the output of '-D'
into some xarg action to 'rm' and 'ln -s' the duplicates.

Cheers

-- 
Alexander Clouter
.sigmonster says: For fast-acting relief, try slowing down.



Re: SHA question

2010-01-13 Thread Philip Newton
On Wed, Jan 13, 2010 at 15:58, Dermot paik...@googlemail.com wrote:
 2010/1/13 Avi Greenbury avismailinglistacco...@googlemail.com:

 You might've missed his point.

 If two files are of different sizes, they cannot be identical. Getting
 the size of a file is substantially cheaper than hashing it.

 So you check all your filesizes, and need only hash those pairs or
 groups that are all the same size.

 Sorry guess I didn't make myself clear. I need to store the SHA in an
 SQLite file.

I think you're putting the cart before the horse.

Did someone come up to you and say, Dermot, put the SHA value in a database.?

I would have thought that you *need* to make sure that you detect
duplicate files (for example, to avoid processing the same file
twice). Storing the SHA in an SQLite file is a method you would *like*
to use to accomplish this, but may not be the only way nor the best
way.

Along those lines, you may wish to store the filesize in bytes in your
database as well, as a first point of comparison; if the filesize is
unique, then the file must also be unique and you could save yourself
the time spent calculating a digest of the file's contents -- no
1058-byte file can be the same as any 1927-byte file.

 Incident I get poor results from the MD5 compared with SHA so I can't
 relie on MD5 for

 MD5 (md5_base64) results:
 mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
 MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
 duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
 MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
 PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
 mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
 PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32

 SHA (b64digest) results:
 mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
 MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
 duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
 MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
 PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
 mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
 PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27

That's... odd. md5sum's guarantee of same if the hashes match isn't
as strong as SHA's, but I still wouldn't expect two files to md5sum
the same if their SHA sums don'T match.

However, those MD5 sums don't look like base-64 to me, so maybe you're
doing something wrong somewhere.

Cheers,
Philip
-- 
Philip Newton philip.new...@gmail.com



Re: SHA question

2010-01-13 Thread Andy Armstrong
On 13 Jan 2010, at 14:58, Dermot wrote:
 Incident I get poor results from the MD5 compared with SHA so I can't
 relie on MD5 for
 
 MD5 (md5_base64) results:
 mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
 MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
 duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
 MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
 PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
 mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
 PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

If those files are different you're doing it wrong :)

-- 
Andy Armstrong, Hexten





Re: SHA question

2010-01-13 Thread Andy Armstrong
On 13 Jan 2010, at 14:58, Dermot wrote:
 MD5 (md5_base64) results:
 mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
 MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
 duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
 MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
 PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
 mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
 PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32


Oh and run them through md5 in the shell to see what you get - the results 
should be the same.

-- 
Andy Armstrong, Hexten






Re: SHA question

2010-01-13 Thread Roger Burton West
On Wed, Jan 13, 2010 at 02:25:58PM +, Alexander Clouter wrote:

The following gives the duplicated hashes (you might prefer '-D' instead
of '-d'):

But does not take account of hardlinks, and again hashes every file
rather than just the ones that might be duplicates.

R


Re: SHA question

2010-01-13 Thread Dan Rowles

Dermot wrote:
[snip]

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

  
I think you must have a bug. Finding three MD5 collisions in seven files 
that are actually different to each other would be a really remarkable 
result


Dan



Re: SHA question

2010-01-13 Thread Matthew Boyle

Dan Rowles wrote:

Dermot wrote:
[snip]

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

  
I think you must have a bug. Finding three MD5 collisions in seven files 
that are actually different to each other would be a really remarkable 
result


depends on where the PDFs came from :-) 
http://www.win.tue.nl/hashclash/Nostradamus/


--matt


--
Matthew Boyle, Systems Administrator, CoreFiling Limited
Telephone: +44-1865-203192  Website: http://www.corefiling.com


Re: SHA question

2010-01-13 Thread A. J. Trickett
On Wed, 13 Jan 2010 at 12:44:47PM +, Dermot wrote:
 Hi,
 
 I have a lots of PDFs that I need to catalogue and I want to ensure
 the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
 something similar with SHA1 and binary files.  Am I right in thinking
 that the code below is only taking the SHA on the name of the file and
 if I want to ensure uniqueness of the content I need to do something
 similar but as a file blob?
 

Have a look here: http://en.wikipedia.org/wiki/Fdupes

There are links to Perl examples, that do SHA de-duplication.

-- 
Adam Trickett
Overton, HANTS, UK

A bank is a place where they lend you an umbrella in fair
weather and ask for it back when it begins to rain.
--  Robert Frost


Re: SHA question

2010-01-13 Thread Paul Makepeace
On Wed, Jan 13, 2010 at 07:16, Philip Newton philip.new...@gmail.com wrote:
 On Wed, Jan 13, 2010 at 15:58, Dermot paik...@googlemail.com wrote:
 2010/1/13 Avi Greenbury avismailinglistacco...@googlemail.com:

 You might've missed his point.

 If two files are of different sizes, they cannot be identical. Getting
 the size of a file is substantially cheaper than hashing it.

 So you check all your filesizes, and need only hash those pairs or
 groups that are all the same size.

 Sorry guess I didn't make myself clear. I need to store the SHA in an
 SQLite file.

 I think you're putting the cart before the horse.

 Did someone come up to you and say, Dermot, put the SHA value in a 
 database.?

 I would have thought that you *need* to make sure that you detect
 duplicate files (for example, to avoid processing the same file
 twice). Storing the SHA in an SQLite file is a method you would *like*
 to use to accomplish this, but may not be the only way nor the best
 way.

 Along those lines, you may wish to store the filesize in bytes in your
 database as well, as a first point of comparison; if the filesize is
 unique, then the file must also be unique and you could save yourself
 the time spent calculating a digest of the file's contents -- no
 1058-byte file can be the same as any 1927-byte file.

If you're storing the collision data (size, hash, whatever) to protect
against future collisions the only way this scheme of avoiding more
expensive ops like hashing will work (AFAICS) is if you have some
fiddlier code to lazily hash an old file when a newer future file
comes along that matches an existing file size.

 Incident I get poor results from the MD5 compared with SHA so I can't
 relie on MD5 for

 MD5 (md5_base64) results:
 mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
 MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
 duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
 MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
 PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
 mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
 PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32

 SHA (b64digest) results:
 mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
 MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
 duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
 MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
 PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
 mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
 PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27

 That's... odd. md5sum's guarantee of same if the hashes match isn't
 as strong as SHA's, but I still wouldn't expect two files to md5sum
 the same if their SHA sums don'T match.

 However, those MD5 sums don't look like base-64 to me, so maybe you're
 doing something wrong somewhere.

 Cheers,
 Philip
 --
 Philip Newton philip.new...@gmail.com






Re: SHA question

2010-01-13 Thread David Cantrell
On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote:

 I am using it in a perl class but if I could system(`fdupes`) that
 might be preferable. I'll try building the sources and see what
 happens. Failing that I'll have to fallback to slurping and SHA or
 MD5.

Other hashing algorithms exist and are faster but more prone to
inadvertant collisions.  If you've got a lot of data to compare, I'd
use one of them (eg one of the variations on a CRC) and then only
bring out the big SHA guns when that finds a collision.  

-- 
David Cantrell | even more awesome than a panda-fur coat

 Nuke a disabled unborn gay baby whale for JESUS!


Re: SHA question

2010-01-13 Thread Dermot
2010/1/13 Paul Makepeace pa...@paulm.com:
 On Wed, Jan 13, 2010 at 07:16, Philip Newton philip.new...@gmail.com wrote:
 On Wed, Jan 13, 2010 at 15:58, Dermot paik...@googlemail.com wrote:
 2010/1/13 Avi Greenbury avismailinglistacco...@googlemail.com:

 I think you're putting the cart before the horse.

 Did someone come up to you and say, Dermot, put the SHA value in a 
 database.?

 I would have thought that you *need* to make sure that you detect
 duplicate files (for example, to avoid processing the same file
 twice). Storing the SHA in an SQLite file is a method you would *like*
 to use to accomplish this, but may not be the only way nor the best
 way.


Yet more background. *sign* The process runs as follows:

1) A source submits some digital files.
2) Extract EXIF from digital files that may contain the name of the PDF file.
3) Find said PDF on file system.
3) DB - Have I seen this PDF before?
Yes: Assign existing ID to the new row we're creating for it's
parent record (the digital file).
No: Assign PDF an ID, assign ID to parent record, rename,
post/upload to remote server.

The same PDF can be come from a number of sources so that are not
unique to a source and the same PDF may appear more than one (parent)
records. The PDF exists on a remote server after that so, your right,
I don't want to process the same file twice.


 Along those lines, you may wish to store the filesize in bytes in your
 database as well, as a first point of comparison; if the filesize is
 unique, then the file must also be unique and you could save yourself
 the time spent calculating a digest of the file's contents -- no
 1058-byte file can be the same as any 1927-byte file.

If I go with byte size and do ('PDF')-search({ file_size = 1058})
and get 3 results I then have to back-track, take the SHA and do to
the search again. With SHA, it might be expensive but it's always
unique[1] so I can simply do ('PDF')-find_or_new({ \%hash}) and get
the ID back.  I don't think your suggesting that I relie on the file
size as a unique identifer and I can see how a search with no results
might short-circuit some stuff. But I will need that SHA when I get
files of the same size so I may as well store it from the beginning.

 If you're storing the collision data (size, hash, whatever) to protect
 against future collisions the only way this scheme of avoiding more
 expensive ops like hashing will work (AFAICS) is if you have some
 fiddlier code to lazily hash an old file when a newer future file
 comes along that matches an existing file size.

 Incident I get poor results from the MD5 compared with SHA so I can't
 relie on MD5 for

 That's... odd. md5sum's guarantee of same if the hashes match isn't
 as strong as SHA's, but I still wouldn't expect two files to md5sum
 the same if their SHA sums don'T match.

 However, those MD5 sums don't look like base-64 to me, so maybe you're
 doing something wrong somewhere.


Yes, I'd better 'fess up here. I had a bug :P I was using the
hex_b4base() in a not too clever way. I should have been using
addfile().
Dp


[1] At least once in 1x10^-64


Re: SHA question

2010-01-13 Thread David Cantrell
On Wed, Jan 13, 2010 at 02:58:59PM +, Dermot wrote:
 2010/1/13 Avi Greenbury avismailinglistacco...@googlemail.com:
  Thirdly, be aware of what hashing guarantees. It does *not* guarantee
  uniqueness, it just gives you a very low chance that two files with
  the same hash are different. It does guarantee that files with
  different hashes are different, though.
 I think that's the best I can hope for. If that 'duplicate.pdf' turned
 up again at least I be able to correctly identify it. That's the goal.
 I will give fdupes a look too.

Of course, if SHA (or whatever) does give you the same result for two
files, verifying that they really are the same is trivial ... (and if
they're not, lots of people would be Really Interested to know).

-- 
David Cantrell | Bourgeois reactionary pig

   When a man is tired of London, he is tired of life
  -- Samuel Johnson


Re: SHA question

2010-01-13 Thread Paul Makepeace
On Wed, Jan 13, 2010 at 09:53, David Cantrell da...@cantrell.org.uk wrote:
 On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote:

 I am using it in a perl class but if I could system(`fdupes`) that
 might be preferable. I'll try building the sources and see what
 happens. Failing that I'll have to fallback to slurping and SHA or
 MD5.

 Other hashing algorithms exist and are faster but more prone to
 inadvertant collisions.  If you've got a lot of data to compare, I'd
 use one of them (eg one of the variations on a CRC) and then only
 bring out the big SHA guns when that finds a collision.

Or cmp ;-)