subject:"SHA question"

Re: SHA question

2010-01-16 Thread David Precious


Andy Wardley wrote:

On 14/01/2010 17:41, Philip Newton wrote:

Yes - you're missing the fact that in order to compute the differences
(which it has to if it doesn't want to transfer the whole file), it
has to read the entire file over the slow NFS link into your
computer's memory in order to compare it with the "local" file in
order to tell which pieces have changed.


No, I don't think it does.

My understanding[*] is that it computes a checksum for each block of a file
and only transmits blocks that have different checksums.


Of course, but to compute a checksum for each block of the file, that 
block first needs to be read, over the NFS connection, which is the 
whole issue.


Normally, rsync would be speaking to rsync running on the remote box, 
but the situation David described was one rsync process on box A, 
accessing files on box B via an NFS mount (as opposed to speaking to an 
rsync daemon on box B).


I'm not entirely sure, but I think that rsync will first compare the 
timestamps of the the two files, and if the timestamps match (to within 
the window specified with --modify-window, defaulting to an exact 
match), and the sizes match, it will consider the file to be the same, 
and skip generating checksums (so the file's data won't be read over NFS).

Re: SHA question

2010-01-15 Thread Andy Wardley


On 15/01/2010 20:23, Roger Burton West wrote:

And to calculate the checksum on each block of the file, it has to, um,
read each block of the file... yes?


Sorry, I missed this bit in Philip's message:

> if both source and destination are on a local file system

I was thinking about remote comparisons.  In which case the remote rsync
daemon computes the checksum.  Yes, it has to read the entire file, but not
transmit it.

A

Re: SHA question

2010-01-15 Thread Ask Bjørn Hansen


On Jan 15, 2010, at 14:19, ian wrote:

>>> My understanding[*] is that it computes a checksum for each block of a file
>>> and only transmits blocks that have different checksums.
>> 
>> And to calculate the checksum on each block of the file, it has to, um,
>> read each block of the file... yes?
>> 
> Doesn't rsync *push* rather than *pull* in which case the files it computes 
> the checksum on are all local.
> 
> I did not think it worked in the way you mention without rsync daemon running 
> at the remote end doing the checksum for you.

But with NFS the "remote" is "local".  You need an rsync box running where the 
storage is to get "cheaper" checksums.


  - ask

Re: SHA question

2010-01-15 Thread ian


On 15/01/2010 20:23, Roger Burton West wrote:

On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote:


My understanding[*] is that it computes a checksum for each block of a file
and only transmits blocks that have different checksums.


And to calculate the checksum on each block of the file, it has to, um,
read each block of the file... yes?

Doesn't rsync *push* rather than *pull* in which case the files it 
computes the checksum on are all local.


I did not think it worked in the way you mention without rsync daemon 
running at the remote end doing the checksum for you.

Re: SHA question

2010-01-15 Thread Roger Burton West

On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote:

> My understanding[*] is that it computes a checksum for each block of a file
> and only transmits blocks that have different checksums.

And to calculate the checksum on each block of the file, it has to, um,
read each block of the file... yes?

R

Re: SHA question

2010-01-15 Thread Andy Wardley


On 14/01/2010 17:41, Philip Newton wrote:

Yes - you're missing the fact that in order to compute the differences
(which it has to if it doesn't want to transfer the whole file), it
has to read the entire file over the slow NFS link into your
computer's memory in order to compare it with the "local" file in
order to tell which pieces have changed.


No, I don't think it does.

My understanding[*] is that it computes a checksum for each block of a file
and only transmits blocks that have different checksums.  That's how it
handles incremental changes on large files (e.g. an extra few lines at
the end of a log file doesn't require the whole file to be transmitted).

Some relevant options are:

   --checksum always checksum
   --block-size   checksum block size
   --whole-file   transmit the whole file
   --size-onlycompare file size instead of checksum

A

[*] which could be flawed

Re: SHA question

2010-01-14 Thread Philip Newton

On Thu, Jan 14, 2010 at 16:20, Matthew Boyle
 wrote:
> David Cantrell wrote:
>>
>> On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:
>>
>>> That reminds me of how I was disappointed to find that rsync generally
>>> transfers complete files (rather than diffs) if both source and
>>> destination are on a local file system -- before I realised that to
>>> compute the diffs, it would have to read the entire first and second
>>> files, and if it's going to read the entire first file from disk
>>> anyway, it can simply dump it over the second file without checking.
>>> Computing diffs would be more work in this case, not less.
>>
>> Shame that "local" includes "at the other end of a really slow NFS
>> connection to the other side of the world". Mind you, absent running the
>> rsync daemon at the other end and using that instead of NFS, I'm not
>> sure if there's a better way of doing it.
>
> the --no-whole-file option?  or am i missing something?

Yes - you're missing the fact that in order to compute the differences
(which it has to if it doesn't want to transfer the whole file), it
has to read the entire file over the slow NFS link into your
computer's memory in order to compare it with the "local" file in
order to tell which pieces have changed.

So transferring the whole file is probably faster, at least under the
assumption that reading and writing are about the same speed over that
slow link. (If reading is much faster than writing, then you might
still save some time this way.)

Cheers,
Philip
-- 
Philip Newton

Re: SHA question

2010-01-14 Thread Matt Lawrence


Matthew Boyle wrote:

David Cantrell wrote:

On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:


That reminds me of how I was disappointed to find that rsync generally
transfers complete files (rather than diffs) if both source and
destination are on a local file system -- before I realised that to
compute the diffs, it would have to read the entire first and second
files, and if it's going to read the entire first file from disk
anyway, it can simply dump it over the second file without checking.
Computing diffs would be more work in this case, not less.


Shame that "local" includes "at the other end of a really slow NFS
connection to the other side of the world". Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.


the --no-whole-file option?  or am i missing something?

This is of course what I was referring to when I mentioned the 
diametrically opposite option. *sigh*


Matt

Re: SHA question

2010-01-14 Thread Matt Lawrence


David Cantrell wrote:

On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote:
  

On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote:


Shame that "local" includes "at the other end of a really slow NFS
connection to the other side of the world". Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.
  

Possibly I'm missing something, but: ssh?



That boils down to the same thing - it ends up invoking rsync at the
other end in daemon mode and talking the rsync protocol tunnelled
through ssh.

What I was getting at was that I don't see a better way of working if
you have to use a networky filesystem.

  

Isn't this what the -W flag is for?

Matt

Re: SHA question

2010-01-14 Thread Peter Corlett

On 14 Jan 2010, at 14:16, Mark Fowler wrote:
[...]
> I'd just use Digest::MD5 to calculate the filesize.  It's cheap
> compared to SHA, you don't care about the exact cryptographic security
> of the hash, and will work even if you don't have the original to
> compare again.

I assume you wrote "filesize" when you meant "digest".

You should consider MD5 compromised unless you know for sure that your problem 
does not need to defend against the relatively low-effort birthday attack 
against it. At this point in time, you shouldn't be considering anything weaker 
than SHA-256 for new code.

Choosing the weak MD5 over SHA-256 because it's faster or produces a shorter 
key is just premature optimisation.

Re: SHA question

2010-01-14 Thread Matthew Boyle


David Cantrell wrote:

On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:


That reminds me of how I was disappointed to find that rsync generally
transfers complete files (rather than diffs) if both source and
destination are on a local file system -- before I realised that to
compute the diffs, it would have to read the entire first and second
files, and if it's going to read the entire first file from disk
anyway, it can simply dump it over the second file without checking.
Computing diffs would be more work in this case, not less.


Shame that "local" includes "at the other end of a really slow NFS
connection to the other side of the world". Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.


the --no-whole-file option?  or am i missing something?

--matt


--
Matthew Boyle, Systems Administrator, CoreFiling Limited
Telephone: +44-1865-203192  Website: http://www.corefiling.com

Re: SHA question

2010-01-14 Thread David Cantrell

On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote:
> On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote:
> >Shame that "local" includes "at the other end of a really slow NFS
> >connection to the other side of the world". Mind you, absent running the
> >rsync daemon at the other end and using that instead of NFS, I'm not
> >sure if there's a better way of doing it.
> Possibly I'm missing something, but: ssh?

That boils down to the same thing - it ends up invoking rsync at the
other end in daemon mode and talking the rsync protocol tunnelled
through ssh.

What I was getting at was that I don't see a better way of working if
you have to use a networky filesystem.

-- 
David Cantrell | Reality Engineer, Ministry of Information

  Longum iter est per praecepta, breve et efficax per exempla.

Re: SHA question

2010-01-14 Thread Mark Fowler

On Wed, Jan 13, 2010 at 3:16 PM, Philip Newton  wrote:

> Along those lines, you may wish to store the filesize in bytes in your
> database as well, as a first point of comparison; if the filesize is
> unique, then the file must also be unique and you could save yourself
> the time spent calculating a digest of the file's contents -- no
> 1058-byte file can be the same as any 1927-byte file.

This is only possible if you've still got all the pdfs on disk, as as
soon as you get your suspected duplicate you'll have to hash both
files' contents to tell if you have or not.  If you've sent them onto
a better place and deleted them however, then you're out of luck.

I'd just use Digest::MD5 to calculate the filesize.  It's cheap
compared to SHA, you don't care about the exact cryptographic security
of the hash, and will work even if you don't have the original to
compare again.

#!/usr/bin/perl

use Modern::Perl;
use autodie;
use Digest::MD5;

my $filename = shift;
open my $fh, "<:bytes", $filename;
my $md5 = Digest::MD5->new;
$md5->addfile($fh);
say "The file's md5 is: " .$md5->b64digest;

Don't forget the "<:bytes" (you're comparing bytes, not characters).

Once you've got a toy version up and running and you can get a "feel"
for how fast it is on your system, you can optimise if you don't like
the performance.

Mark.

Re: SHA question

2010-01-14 Thread Roger Burton West

On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote:

>Shame that "local" includes "at the other end of a really slow NFS
>connection to the other side of the world". Mind you, absent running the
>rsync daemon at the other end and using that instead of NFS, I'm not
>sure if there's a better way of doing it.

Possibly I'm missing something, but: ssh?

R

Re: SHA question

2010-01-14 Thread David Cantrell

On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote:

> That reminds me of how I was disappointed to find that rsync generally
> transfers complete files (rather than diffs) if both source and
> destination are on a local file system -- before I realised that to
> compute the diffs, it would have to read the entire first and second
> files, and if it's going to read the entire first file from disk
> anyway, it can simply dump it over the second file without checking.
> Computing diffs would be more work in this case, not less.

Shame that "local" includes "at the other end of a really slow NFS
connection to the other side of the world". Mind you, absent running the
rsync daemon at the other end and using that instead of NFS, I'm not
sure if there's a better way of doing it.

-- 
David Cantrell | Reality Engineer, Ministry of Information

Re: SHA question

2010-01-14 Thread Philip Newton

On Thu, Jan 14, 2010 at 13:22, Peter Corlett  wrote:
> For de-duping purposes, SHA is still faster than you can pull the files off 
> the disk and a secondary cheaper hash is unnecessary.

That reminds me of how I was disappointed to find that rsync generally
transfers complete files (rather than diffs) if both source and
destination are on a local file system -- before I realised that to
compute the diffs, it would have to read the entire first and second
files, and if it's going to read the entire first file from disk
anyway, it can simply dump it over the second file without checking.
Computing diffs would be more work in this case, not less.

So yes, I suppose something similar applies here -- you have to read
the entire file anyway, so you might as well go with
SHA-$number_of_your_choice.

Cheers,
Philip
-- 
Philip Newton

Re: SHA question

2010-01-14 Thread Peter Corlett

On 13 Jan 2010, at 17:53, David Cantrell wrote:
[...]
> Other hashing algorithms exist and are faster but more prone to
> inadvertant collisions.  If you've got a lot of data to compare, I'd
> use one of them (eg one of the variations on a CRC) and then only
> bring out the big SHA guns when that finds a collision.  

That's a premature optimisation which just complicates the code, unless you 
mean *a lot* such as in the rdiff algorithm.

For de-duping purposes, SHA is still faster than you can pull the files off the 
disk and a secondary cheaper hash is unnecessary.

Re: SHA question

2010-01-13 Thread Paul Makepeace

On Wed, Jan 13, 2010 at 09:53, David Cantrell  wrote:
> On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote:
>
>> I am using it in a perl class but if I could system(`fdupes`) that
>> might be preferable. I'll try building the sources and see what
>> happens. Failing that I'll have to fallback to slurping and SHA or
>> MD5.
>
> Other hashing algorithms exist and are faster but more prone to
> inadvertant collisions.  If you've got a lot of data to compare, I'd
> use one of them (eg one of the variations on a CRC) and then only
> bring out the big SHA guns when that finds a collision.

Or cmp ;-)

Re: SHA question

2010-01-13 Thread David Cantrell

On Wed, Jan 13, 2010 at 02:58:59PM +, Dermot wrote:
> 2010/1/13 Avi Greenbury :
> > Thirdly, be aware of what hashing guarantees. It does *not* guarantee
> > uniqueness, it just gives you a very low chance that two files with
> > the same hash are different. It does guarantee that files with
> > different hashes are different, though.
> I think that's the best I can hope for. If that 'duplicate.pdf' turned
> up again at least I be able to correctly identify it. That's the goal.
> I will give fdupes a look too.

Of course, if SHA (or whatever) does give you the same result for two
files, verifying that they really are the same is trivial ... (and if
they're not, lots of people would be Really Interested to know).

-- 
David Cantrell | Bourgeois reactionary pig

   When a man is tired of London, he is tired of life
  -- Samuel Johnson

Re: SHA question

2010-01-13 Thread Dermot

2010/1/13 Paul Makepeace :
> On Wed, Jan 13, 2010 at 07:16, Philip Newton  wrote:
>> On Wed, Jan 13, 2010 at 15:58, Dermot  wrote:
>>> 2010/1/13 Avi Greenbury :
>>
>> I think you're putting the cart before the horse.
>>
>> Did someone come up to you and say, "Dermot, put the SHA value in a 
>> database."?

>> I would have thought that you *need* to make sure that you detect
>> duplicate files (for example, to avoid processing "the same" file
>> twice). Storing the SHA in an SQLite file is a method you would *like*
>> to use to accomplish this, but may not be the only way nor the best
>> way.

Yet more background. *sign* The process runs as follows:

1) A source submits some digital files.
2) Extract EXIF from digital files that may contain the name of the PDF file.
3) Find said PDF on file system.
3) DB - Have I seen this PDF before?
Yes: Assign existing ID to the new row we're creating for it's
parent record (the digital file).
No: Assign PDF an ID, assign ID to parent record, rename,
post/upload to remote server.

The same PDF can be come from a number of sources so that are not
unique to a source and the same PDF may appear more than one (parent)
records. The PDF exists on a remote server after that so, your right,
I don't want to process the same file twice.

>> Along those lines, you may wish to store the filesize in bytes in your
>> database as well, as a first point of comparison; if the filesize is
>> unique, then the file must also be unique and you could save yourself
>> the time spent calculating a digest of the file's contents -- no
>> 1058-byte file can be the same as any 1927-byte file.

If I go with byte size and do ('PDF')->search({ file_size => 1058})
and get 3 results I then have to back-track, take the SHA and do to
the search again. With SHA, it might be expensive but it's always
unique[1] so I can simply do ('PDF')->find_or_new({ \%hash}) and get
the ID back.  I don't think your suggesting that I relie on the file
size as a unique identifer and I can see how a search with no results
might short-circuit some stuff. But I will need that SHA when I get
files of the same size so I may as well store it from the beginning.

> If you're storing the collision data (size, hash, whatever) to protect
> against future collisions the only way this scheme of avoiding more
> expensive ops like hashing will work (AFAICS) is if you have some
> fiddlier code to lazily hash an old file when a newer future file
> comes along that matches an existing file size.
>
>>> Incident I get poor results from the MD5 compared with SHA so I can't
>>> relie on MD5 for
>>
>> That's... odd. md5sum's guarantee of "same if the hashes match" isn't
>> as strong as SHA's, but I still wouldn't expect two files to md5sum
>> the same if their SHA sums don'T match.
>>
>> However, those MD5 sums don't look like base-64 to me, so maybe you're
>> doing something wrong somewhere.

Yes, I'd better 'fess up here. I had a bug :P I was using the
hex_b4base() in a not too clever way. I should have been using
addfile().
Dp

[1] At least once in 1x10^-64

Re: SHA question

2010-01-13 Thread David Cantrell

On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote:

> I am using it in a perl class but if I could system(`fdupes`) that
> might be preferable. I'll try building the sources and see what
> happens. Failing that I'll have to fallback to slurping and SHA or
> MD5.

Other hashing algorithms exist and are faster but more prone to
inadvertant collisions.  If you've got a lot of data to compare, I'd
use one of them (eg one of the variations on a CRC) and then only
bring out the big SHA guns when that finds a collision.  

-- 
David Cantrell | even more awesome than a panda-fur coat

 Nuke a disabled unborn gay baby whale for JESUS!

Re: SHA question

2010-01-13 Thread Paul Makepeace

On Wed, Jan 13, 2010 at 07:16, Philip Newton  wrote:
> On Wed, Jan 13, 2010 at 15:58, Dermot  wrote:
>> 2010/1/13 Avi Greenbury :
>>
>>> You might've missed his point.
>>>
>>> If two files are of different sizes, they cannot be identical. Getting
>>> the size of a file is substantially cheaper than hashing it.
>>>
>>> So you check all your filesizes, and need only hash those pairs or
>>> groups that are all the same size.
>>
>> Sorry guess I didn't make myself clear. I need to store the SHA in an
>> SQLite file.
>
> I think you're putting the cart before the horse.
>
> Did someone come up to you and say, "Dermot, put the SHA value in a 
> database."?
>
> I would have thought that you *need* to make sure that you detect
> duplicate files (for example, to avoid processing "the same" file
> twice). Storing the SHA in an SQLite file is a method you would *like*
> to use to accomplish this, but may not be the only way nor the best
> way.
>
> Along those lines, you may wish to store the filesize in bytes in your
> database as well, as a first point of comparison; if the filesize is
> unique, then the file must also be unique and you could save yourself
> the time spent calculating a digest of the file's contents -- no
> 1058-byte file can be the same as any 1927-byte file.

If you're storing the collision data (size, hash, whatever) to protect
against future collisions the only way this scheme of avoiding more
expensive ops like hashing will work (AFAICS) is if you have some
fiddlier code to lazily hash an old file when a newer future file
comes along that matches an existing file size.

>> Incident I get poor results from the MD5 compared with SHA so I can't
>> relie on MD5 for
>>
>> MD5 (md5_base64) results:
>> mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
>> MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
>> duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
>> MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
>> PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
>> mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
>> PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32
>>
>> SHA (b64digest) results:
>> mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
>> MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
>> duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
>> MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
>> PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
>> mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
>> PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27
>
> That's... odd. md5sum's guarantee of "same if the hashes match" isn't
> as strong as SHA's, but I still wouldn't expect two files to md5sum
> the same if their SHA sums don'T match.
>
> However, those MD5 sums don't look like base-64 to me, so maybe you're
> doing something wrong somewhere.
>
> Cheers,
> Philip
> --
> Philip Newton 
>
>

Re: SHA question

2010-01-13 Thread A. J. Trickett

On Wed, 13 Jan 2010 at 12:44:47PM +, Dermot wrote:
> Hi,
> 
> I have a lots of PDFs that I need to catalogue and I want to ensure
> the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
> something similar with SHA1 and binary files.  Am I right in thinking
> that the code below is only taking the SHA on the name of the file and
> if I want to ensure uniqueness of the content I need to do something
> similar but as a file blob?
> 

Have a look here: http://en.wikipedia.org/wiki/Fdupes

There are links to Perl examples, that do SHA de-duplication.

-- 
Adam Trickett
Overton, HANTS, UK

A bank is a place where they lend you an umbrella in fair
weather and ask for it back when it begins to rain.
--  Robert Frost

Re: SHA question

2010-01-13 Thread Matthew Boyle


Dan Rowles wrote:

Dermot wrote:
[snip]

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

  
I think you must have a bug. Finding three MD5 collisions in seven files 
that are actually different to each other would be a really remarkable 
result


depends on where the PDFs came from :-) 
http://www.win.tue.nl/hashclash/Nostradamus/


--matt


--
Matthew Boyle, Systems Administrator, CoreFiling Limited
Telephone: +44-1865-203192  Website: http://www.corefiling.com

Re: SHA question

2010-01-13 Thread Dan Rowles


Dermot wrote:
[snip]

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

  
I think you must have a bug. Finding three MD5 collisions in seven files 
that are actually different to each other would be a really remarkable 
result


Dan

Re: SHA question

2010-01-13 Thread Roger Burton West

On Wed, Jan 13, 2010 at 02:25:58PM +, Alexander Clouter wrote:

>The following gives the duplicated hashes (you might prefer '-D' instead
>of '-d'):

But does not take account of hardlinks, and again hashes every file
rather than just the ones that might be duplicates.

R

Re: SHA question

2010-01-13 Thread Andy Armstrong

On 13 Jan 2010, at 14:58, Dermot wrote:
> MD5 (md5_base64) results:
> mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
> MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
> duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
> MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
> PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
> mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
> PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32


Oh and run them through md5 in the shell to see what you get - the results 
should be the same.

-- 
Andy Armstrong, Hexten

Re: SHA question

2010-01-13 Thread Andy Armstrong

On 13 Jan 2010, at 14:58, Dermot wrote:
> Incident I get poor results from the MD5 compared with SHA so I can't
> relie on MD5 for
> 
> MD5 (md5_base64) results:
> mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
> MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
> duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
> MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
> PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
> mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
> PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

If those files are different you're doing it wrong :)

-- 
Andy Armstrong, Hexten

Re: SHA question

2010-01-13 Thread Philip Newton

On Wed, Jan 13, 2010 at 15:58, Dermot  wrote:
> 2010/1/13 Avi Greenbury :
>
>> You might've missed his point.
>>
>> If two files are of different sizes, they cannot be identical. Getting
>> the size of a file is substantially cheaper than hashing it.
>>
>> So you check all your filesizes, and need only hash those pairs or
>> groups that are all the same size.
>
> Sorry guess I didn't make myself clear. I need to store the SHA in an
> SQLite file.

I think you're putting the cart before the horse.

Did someone come up to you and say, "Dermot, put the SHA value in a database."?

I would have thought that you *need* to make sure that you detect
duplicate files (for example, to avoid processing "the same" file
twice). Storing the SHA in an SQLite file is a method you would *like*
to use to accomplish this, but may not be the only way nor the best
way.

Along those lines, you may wish to store the filesize in bytes in your
database as well, as a first point of comparison; if the filesize is
unique, then the file must also be unique and you could save yourself
the time spent calculating a digest of the file's contents -- no
1058-byte file can be the same as any 1927-byte file.

> Incident I get poor results from the MD5 compared with SHA so I can't
> relie on MD5 for
>
> MD5 (md5_base64) results:
> mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
> MR_2891.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
> duplicate.pdf         01f73c142dae9f9f403bbab543b6aa6f 32
> MR_2898.pdf          01f73c142dae9f9f403bbab543b6aa6f 32
> PR_A02.pdf           5552e6587357f9967dc0bc83153cca63 32
> mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
> PR_A01.pdf           5552e6587357f9967dc0bc83153cca63 32
>
> SHA (b64digest) results:
> mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
> MR_2891.pdf          bQhWA445KFzXy6ldF/DSoG2xTEY 27
> duplicate.pdf         bQhWA445KFzXy6ldF/DSoG2xTEY 27
> MR_2898.pdf          ULBRZQB00qZIfIWD7oqdpfVpFtw 27
> PR_A02.pdf           6LdF6sWZnyLdWj44inFI6MSaUY4 27
> mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
> PR_A01.pdf           JS33nJhzTo9YTqRWe01xnOb6bEM 27

That's... odd. md5sum's guarantee of "same if the hashes match" isn't
as strong as SHA's, but I still wouldn't expect two files to md5sum
the same if their SHA sums don'T match.

However, those MD5 sums don't look like base-64 to me, so maybe you're
doing something wrong somewhere.

Cheers,
Philip
-- 
Philip Newton

Re: SHA question

2010-01-13 Thread Alexander Clouter

Roger Burton West  wrote:
>
> You may want to be slightly cleverer about it - taking a SHAsum is
> computationally expensive, and it's only worth doing if the files have
> the same size.
>
> If you don't require a pure-Perl solution, bear in mind that all this
> has been done for you in the "fdupes" program, already in Debian or at
> http://netdial.caribe.net/~adrian2/programs/ .
>
*sigh*

The following gives the duplicated hashes (you might prefer '-D' instead
of '-d'):

md5sum /path/to/pdfs | sort | uniq -d

Replace the '-d' with '-u' if you want to just see the unique ones.

I'll leave it as an exercise for the reader to pipe the output of '-D'
into some xarg action to 'rm' and 'ln -s' the duplicates.

Cheers

-- 
Alexander Clouter
.sigmonster says: For fast-acting relief, try slowing down.

Re: SHA question

2010-01-13 Thread Peter Corlett

On 13 Jan 2010, at 14:40, Philip Newton wrote:
[...]
> Well, that said, is the "very low chance" not on the order of the
> chance that you'll be run over by a bus in the morning, or that one of
> the files will be changed through cosmic rays or bit rot in the
> magnetic domains of the hard disk platter?

In the case of SHA-256, the odds are low enough that the universe is likely to 
end before you find a collision.

Re: SHA question

2010-01-13 Thread Dermot

2010/1/13 Avi Greenbury :

> You might've missed his point.
>
> If two files are of different sizes, they cannot be identical. Getting
> the size of a file is substantially cheaper than hashing it.
>
> So you check all your filesizes, and need only hash those pairs or
> groups that are all the same size.

Sorry guess I didn't make myself clear. I need to store the SHA in an
SQLite file. I have a few files to handle now but I will get a
constant dribble from now on. I want to try and ensure that I haven't
already databased a file that I'll process in the future.

Incident I get poor results from the MD5 compared with SHA so I can't
relie on MD5 for

MD5 (md5_base64) results:
mr_485_htu_AST.pdf   116caa6cc1705db23a36feb11c8c4113 32
MR_2891.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32
MR_2898.pdf  01f73c142dae9f9f403bbab543b6aa6f 32
PR_A02.pdf   5552e6587357f9967dc0bc83153cca63 32
mr_485_htu_hrt.pdf   116caa6cc1705db23a36feb11c8c4113 32
PR_A01.pdf   5552e6587357f9967dc0bc83153cca63 32

SHA (b64digest) results:
mr_485_htu_AST.pdf   PqsBpkKgGxdEHvkoNyou1NV5kuY 27
MR_2891.pdf  bQhWA445KFzXy6ldF/DSoG2xTEY 27
duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27
MR_2898.pdf  ULBRZQB00qZIfIWD7oqdpfVpFtw 27
PR_A02.pdf   6LdF6sWZnyLdWj44inFI6MSaUY4 27
mr_485_htu_hrt.pdf   0VNwG7IiaIneEX3jh3SBUBaXMK0 27
PR_A01.pdf   JS33nJhzTo9YTqRWe01xnOb6bEM 27


> Thirdly, be aware of what hashing guarantees. It does *not* guarantee
> uniqueness, it just gives you a very low chance that two files with
> the same hash are different. It does guarantee that files with
> different hashes are different, though.
>

I think that's the best I can hope for. If that 'duplicate.pdf' turned
up again at least I be able to correctly identify it. That's the goal.
I will give fdupes a look too.
Thanks all.
Dp.

Re: SHA question

2010-01-13 Thread Philip Newton

On Wed, Jan 13, 2010 at 15:06, James Laver  wrote:
> Thirdly, be aware of what hashing guarantees. It does *not* guarantee
> uniqueness, it just gives you a very low chance that two files with
> the same hash are different.

Well, that said, is the "very low chance" not on the order of the
chance that you'll be run over by a bus in the morning, or that one of
the files will be changed through cosmic rays or bit rot in the
magnetic domains of the hard disk platter?

In other words, is 1x10^-64 (or whatever it might be) not so small as
to be effectively zero, since there are much "higher" risks (say,
1x10^-32) which you do not guard against, either?

Cheers,
Philip
-- 
Philip Newton

Re: SHA question

2010-01-13 Thread James Laver

On Wed, Jan 13, 2010 at 1:46 PM, Dermot  wrote:
> 2010/1/13 Roger Burton West :
>
>>>I am using it in a perl class
>>
>> So I won't point out the implications, but there's an obvious one which
>> will make your life easier.
>
> You can't leave me hanging there
> Dp.
>

Well, there are a few things...

Firstly, you are indeed just hashing the filename, not the file contents.

Secondly, you're using Digest::SHA directly. The Digest:: series of
modules are meant to be used through the 'Digest' interface as in the
example Steffan gave. Doing this will make your life easier in most
cases (by providing a standard interface across almost all digest
algorithms and making it easy to switch (though ::Whirlpool disobeys
the rules of the interface :/ )) and provides the handy addfile method
you're looking for.

Thirdly, be aware of what hashing guarantees. It does *not* guarantee
uniqueness, it just gives you a very low chance that two files with
the same hash are different. It does guarantee that files with
different hashes are different, though.

Lastly, as regards on-topicness, Perl is definitely off-topic. Beer,
Pies, Dim Sum and Buffy are on-topic.*

On topic: Buffy eating a dim sum pie and washing it down with beer.

--James
* But you can still post perl here.

Re: SHA question

2010-01-13 Thread Avi Greenbury

Dermot wrote:
> 2010/1/13 Roger Burton West :
> > You may want to be slightly cleverer about it - taking a SHAsum is
> > computationally expensive, and it's only worth doing if the files
> > have the same size.
> 
> Unfortunately the size varies quite a bit.

You might've missed his point.

If two files are of different sizes, they cannot be identical. Getting
the size of a file is substantially cheaper than hashing it.

So you check all your filesizes, and need only hash those pairs or
groups that are all the same size.

-- 
Avi Greenbury

Re: SHA question

2010-01-13 Thread Philip Potter

2010/1/13 Luis Motta Campos :
> I believe the official answer to this question would be "The London Perl
> Mongers list considers on-topic messages that talk about Ponies, Buffy,
> Beer, and Pie. Everything else should be tagged as 'off-toppic'".

There is even a FAQ about this: http://london.pm.org/about/faq.html#topic

Having said that, I've been lurking here a few months now and I've
seen very little talk of any of the aforementioned topics D:

Phil

Re: SHA question

2010-01-13 Thread Dermot

2010/1/13 Roger Burton West :

>>I am using it in a perl class
>
> So I won't point out the implications, but there's an obvious one which
> will make your life easier.

You can't leave me hanging there
Dp.

Re: SHA question

2010-01-13 Thread Roger Burton West

On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote:

>Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
>but the majority are under 1mb.

No, that's _good_.

>I am using it in a perl class

So I won't point out the implications, but there's an obvious one which
will make your life easier.

R

Re: SHA question

2010-01-13 Thread Steffan Davies

Dermot  wrote at 12:44 on 2010-01-13:

> Hi,
> 
> I have a lots of PDFs that I need to catalogue and I want to ensure
> the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
> something similar with SHA1 and binary files.  Am I right in thinking
> that the code below is only taking the SHA on the name of the file and
> if I want to ensure uniqueness of the content I need to do something
> similar but as a file blob?

Yes, that looks about right. From a brief look at
http://perldoc.perl.org/Digest/SHA.html it appears that you may want 

my $sha = Digest::SHA->new(512);
$sha->addfile($n);
$digest=$sha->digest; # or hexdigest or b64digest

in your inner loop.

S

Re: SHA question

2010-01-13 Thread Luis Motta Campos

Dermot wrote:
> Hi,
> 
> I have a lots of PDFs that I need to catalogue and I want to ensure 
> the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned 
> something similar with SHA1 and binary files.  Am I right in thinking
>  that the code below is only taking the SHA on the name of the file
> and if I want to ensure uniqueness of the content I need to do
> something similar but as a file blob?
> 
> [code was here]
> 

Yes, your code processes file names, not file contents.

> PS: I don't see many perl questions here, am I breaking a convention?

I believe the official answer to this question would be "The London Perl
Mongers list considers on-topic messages that talk about Ponies, Buffy,
Beer, and Pie. Everything else should be tagged as 'off-toppic'".

As I'm really bad at remembering things and also a non-native speaker,
YMMV, wording- and semantic-wise.

Cheers
-- 
Luis Motta Campos is a software engineer,
Perl Programmer, foodie and photographer.

Re: SHA question

2010-01-13 Thread Dermot

2010/1/13 Roger Burton West :
> On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote:
>
>>I have a lots of PDFs that I need to catalogue and I want to ensure
>>the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
>>something similar with SHA1 and binary files.  Am I right in thinking
>>that the code below is only taking the SHA on the name of the file and
>>if I want to ensure uniqueness of the content I need to do something
>>similar but as a file blob?
>
> Yes.
>
> You may want to be slightly cleverer about it - taking a SHAsum is
> computationally expensive, and it's only worth doing if the files have
> the same size.

Unfortunately the size varies quite a bit. There are a few 11Mb pdfs
but the majority are under 1mb. This application isn't for public
consumption so I don't have to worry about speed. However there are
other services on the server and I wouldn't want to blindly slurp a
50mb pdf I guess.

> If you don't require a pure-Perl solution, bear in mind that all this
> has been done for you in the "fdupes" program, already in Debian or at
> http://netdial.caribe.net/~adrian2/programs/ .

I am using it in a perl class but if I could system(`fdupes`) that
might be preferable. I'll try building the sources and see what
happens. Failing that I'll have to fallback to slurping and SHA or
MD5.

Thanx,
Dp.

Re: SHA question

2010-01-13 Thread Roger Burton West

On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote:

>I have a lots of PDFs that I need to catalogue and I want to ensure
>the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
>something similar with SHA1 and binary files.  Am I right in thinking
>that the code below is only taking the SHA on the name of the file and
>if I want to ensure uniqueness of the content I need to do something
>similar but as a file blob?

Yes.

You may want to be slightly cleverer about it - taking a SHAsum is
computationally expensive, and it's only worth doing if the files have
the same size.

If you don't require a pure-Perl solution, bear in mind that all this
has been done for you in the "fdupes" program, already in Debian or at
http://netdial.caribe.net/~adrian2/programs/ .

Roger

SHA question

2010-01-13 Thread Dermot

Hi,

I have a lots of PDFs that I need to catalogue and I want to ensure
the uniqueness of each PDF.  At LWP, Jonathan Rockway mentioned
something similar with SHA1 and binary files.  Am I right in thinking
that the code below is only taking the SHA on the name of the file and
if I want to ensure uniqueness of the content I need to do something
similar but as a file blob?

Thanx,
Dp.


use strict;
use warnings;
use Digest::SHA qw(sha256_hex);
use FindBin qw($Bin);

my $top = "$Bin/pdfs";
opendir my $dir, "$top" or die "Can't open $top: $!\n";
my @files  = grep { /pdf$/ } readdir $dir;

foreach my $n (@files) {
if (-e "$top/$n" }) ) {
my $digest = sha256_hex($n);
print "$n\t$digest\t:". length($digest)."\n";
}
else {
print "Can't find $top/$n\n";
}
}


PS: I don't see many perl questions here, am I breaking a convention?

43 matches

Mail list logo