Re: SHA question
Andy Wardley wrote: On 14/01/2010 17:41, Philip Newton wrote: Yes - you're missing the fact that in order to compute the differences (which it has to if it doesn't want to transfer the whole file), it has to read the entire file over the slow NFS link into your computer's memory in order to compare it with the "local" file in order to tell which pieces have changed. No, I don't think it does. My understanding[*] is that it computes a checksum for each block of a file and only transmits blocks that have different checksums. Of course, but to compute a checksum for each block of the file, that block first needs to be read, over the NFS connection, which is the whole issue. Normally, rsync would be speaking to rsync running on the remote box, but the situation David described was one rsync process on box A, accessing files on box B via an NFS mount (as opposed to speaking to an rsync daemon on box B). I'm not entirely sure, but I think that rsync will first compare the timestamps of the the two files, and if the timestamps match (to within the window specified with --modify-window, defaulting to an exact match), and the sizes match, it will consider the file to be the same, and skip generating checksums (so the file's data won't be read over NFS).
Re: SHA question
On 15/01/2010 20:23, Roger Burton West wrote: And to calculate the checksum on each block of the file, it has to, um, read each block of the file... yes? Sorry, I missed this bit in Philip's message: > if both source and destination are on a local file system I was thinking about remote comparisons. In which case the remote rsync daemon computes the checksum. Yes, it has to read the entire file, but not transmit it. A
Re: SHA question
On Jan 15, 2010, at 14:19, ian wrote: >>> My understanding[*] is that it computes a checksum for each block of a file >>> and only transmits blocks that have different checksums. >> >> And to calculate the checksum on each block of the file, it has to, um, >> read each block of the file... yes? >> > Doesn't rsync *push* rather than *pull* in which case the files it computes > the checksum on are all local. > > I did not think it worked in the way you mention without rsync daemon running > at the remote end doing the checksum for you. But with NFS the "remote" is "local". You need an rsync box running where the storage is to get "cheaper" checksums. - ask
Re: SHA question
On 15/01/2010 20:23, Roger Burton West wrote: On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote: My understanding[*] is that it computes a checksum for each block of a file and only transmits blocks that have different checksums. And to calculate the checksum on each block of the file, it has to, um, read each block of the file... yes? Doesn't rsync *push* rather than *pull* in which case the files it computes the checksum on are all local. I did not think it worked in the way you mention without rsync daemon running at the remote end doing the checksum for you.
Re: SHA question
On Fri, Jan 15, 2010 at 08:16:09PM +, Andy Wardley wrote: > My understanding[*] is that it computes a checksum for each block of a file > and only transmits blocks that have different checksums. And to calculate the checksum on each block of the file, it has to, um, read each block of the file... yes? R
Re: SHA question
On 14/01/2010 17:41, Philip Newton wrote: Yes - you're missing the fact that in order to compute the differences (which it has to if it doesn't want to transfer the whole file), it has to read the entire file over the slow NFS link into your computer's memory in order to compare it with the "local" file in order to tell which pieces have changed. No, I don't think it does. My understanding[*] is that it computes a checksum for each block of a file and only transmits blocks that have different checksums. That's how it handles incremental changes on large files (e.g. an extra few lines at the end of a log file doesn't require the whole file to be transmitted). Some relevant options are: --checksum always checksum --block-size checksum block size --whole-file transmit the whole file --size-onlycompare file size instead of checksum A [*] which could be flawed
Re: SHA question
On Thu, Jan 14, 2010 at 16:20, Matthew Boyle wrote: > David Cantrell wrote: >> >> On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: >> >>> That reminds me of how I was disappointed to find that rsync generally >>> transfers complete files (rather than diffs) if both source and >>> destination are on a local file system -- before I realised that to >>> compute the diffs, it would have to read the entire first and second >>> files, and if it's going to read the entire first file from disk >>> anyway, it can simply dump it over the second file without checking. >>> Computing diffs would be more work in this case, not less. >> >> Shame that "local" includes "at the other end of a really slow NFS >> connection to the other side of the world". Mind you, absent running the >> rsync daemon at the other end and using that instead of NFS, I'm not >> sure if there's a better way of doing it. > > the --no-whole-file option? or am i missing something? Yes - you're missing the fact that in order to compute the differences (which it has to if it doesn't want to transfer the whole file), it has to read the entire file over the slow NFS link into your computer's memory in order to compare it with the "local" file in order to tell which pieces have changed. So transferring the whole file is probably faster, at least under the assumption that reading and writing are about the same speed over that slow link. (If reading is much faster than writing, then you might still save some time this way.) Cheers, Philip -- Philip Newton
Re: SHA question
Matthew Boyle wrote: David Cantrell wrote: On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: That reminds me of how I was disappointed to find that rsync generally transfers complete files (rather than diffs) if both source and destination are on a local file system -- before I realised that to compute the diffs, it would have to read the entire first and second files, and if it's going to read the entire first file from disk anyway, it can simply dump it over the second file without checking. Computing diffs would be more work in this case, not less. Shame that "local" includes "at the other end of a really slow NFS connection to the other side of the world". Mind you, absent running the rsync daemon at the other end and using that instead of NFS, I'm not sure if there's a better way of doing it. the --no-whole-file option? or am i missing something? This is of course what I was referring to when I mentioned the diametrically opposite option. *sigh* Matt
Re: SHA question
David Cantrell wrote: On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote: On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote: Shame that "local" includes "at the other end of a really slow NFS connection to the other side of the world". Mind you, absent running the rsync daemon at the other end and using that instead of NFS, I'm not sure if there's a better way of doing it. Possibly I'm missing something, but: ssh? That boils down to the same thing - it ends up invoking rsync at the other end in daemon mode and talking the rsync protocol tunnelled through ssh. What I was getting at was that I don't see a better way of working if you have to use a networky filesystem. Isn't this what the -W flag is for? Matt
Re: SHA question
On 14 Jan 2010, at 14:16, Mark Fowler wrote: [...] > I'd just use Digest::MD5 to calculate the filesize. It's cheap > compared to SHA, you don't care about the exact cryptographic security > of the hash, and will work even if you don't have the original to > compare again. I assume you wrote "filesize" when you meant "digest". You should consider MD5 compromised unless you know for sure that your problem does not need to defend against the relatively low-effort birthday attack against it. At this point in time, you shouldn't be considering anything weaker than SHA-256 for new code. Choosing the weak MD5 over SHA-256 because it's faster or produces a shorter key is just premature optimisation.
Re: SHA question
David Cantrell wrote: On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: That reminds me of how I was disappointed to find that rsync generally transfers complete files (rather than diffs) if both source and destination are on a local file system -- before I realised that to compute the diffs, it would have to read the entire first and second files, and if it's going to read the entire first file from disk anyway, it can simply dump it over the second file without checking. Computing diffs would be more work in this case, not less. Shame that "local" includes "at the other end of a really slow NFS connection to the other side of the world". Mind you, absent running the rsync daemon at the other end and using that instead of NFS, I'm not sure if there's a better way of doing it. the --no-whole-file option? or am i missing something? --matt -- Matthew Boyle, Systems Administrator, CoreFiling Limited Telephone: +44-1865-203192 Website: http://www.corefiling.com
Re: SHA question
On Thu, Jan 14, 2010 at 02:03:33PM +, Roger Burton West wrote: > On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote: > >Shame that "local" includes "at the other end of a really slow NFS > >connection to the other side of the world". Mind you, absent running the > >rsync daemon at the other end and using that instead of NFS, I'm not > >sure if there's a better way of doing it. > Possibly I'm missing something, but: ssh? That boils down to the same thing - it ends up invoking rsync at the other end in daemon mode and talking the rsync protocol tunnelled through ssh. What I was getting at was that I don't see a better way of working if you have to use a networky filesystem. -- David Cantrell | Reality Engineer, Ministry of Information Longum iter est per praecepta, breve et efficax per exempla.
Re: SHA question
On Wed, Jan 13, 2010 at 3:16 PM, Philip Newton wrote: > Along those lines, you may wish to store the filesize in bytes in your > database as well, as a first point of comparison; if the filesize is > unique, then the file must also be unique and you could save yourself > the time spent calculating a digest of the file's contents -- no > 1058-byte file can be the same as any 1927-byte file. This is only possible if you've still got all the pdfs on disk, as as soon as you get your suspected duplicate you'll have to hash both files' contents to tell if you have or not. If you've sent them onto a better place and deleted them however, then you're out of luck. I'd just use Digest::MD5 to calculate the filesize. It's cheap compared to SHA, you don't care about the exact cryptographic security of the hash, and will work even if you don't have the original to compare again. #!/usr/bin/perl use Modern::Perl; use autodie; use Digest::MD5; my $filename = shift; open my $fh, "<:bytes", $filename; my $md5 = Digest::MD5->new; $md5->addfile($fh); say "The file's md5 is: " .$md5->b64digest; Don't forget the "<:bytes" (you're comparing bytes, not characters). Once you've got a toy version up and running and you can get a "feel" for how fast it is on your system, you can optimise if you don't like the performance. Mark.
Re: SHA question
On Thu, Jan 14, 2010 at 01:59:22PM +, David Cantrell wrote: >Shame that "local" includes "at the other end of a really slow NFS >connection to the other side of the world". Mind you, absent running the >rsync daemon at the other end and using that instead of NFS, I'm not >sure if there's a better way of doing it. Possibly I'm missing something, but: ssh? R
Re: SHA question
On Thu, Jan 14, 2010 at 02:02:51PM +0100, Philip Newton wrote: > That reminds me of how I was disappointed to find that rsync generally > transfers complete files (rather than diffs) if both source and > destination are on a local file system -- before I realised that to > compute the diffs, it would have to read the entire first and second > files, and if it's going to read the entire first file from disk > anyway, it can simply dump it over the second file without checking. > Computing diffs would be more work in this case, not less. Shame that "local" includes "at the other end of a really slow NFS connection to the other side of the world". Mind you, absent running the rsync daemon at the other end and using that instead of NFS, I'm not sure if there's a better way of doing it. -- David Cantrell | Reality Engineer, Ministry of Information
Re: SHA question
On Thu, Jan 14, 2010 at 13:22, Peter Corlett wrote: > For de-duping purposes, SHA is still faster than you can pull the files off > the disk and a secondary cheaper hash is unnecessary. That reminds me of how I was disappointed to find that rsync generally transfers complete files (rather than diffs) if both source and destination are on a local file system -- before I realised that to compute the diffs, it would have to read the entire first and second files, and if it's going to read the entire first file from disk anyway, it can simply dump it over the second file without checking. Computing diffs would be more work in this case, not less. So yes, I suppose something similar applies here -- you have to read the entire file anyway, so you might as well go with SHA-$number_of_your_choice. Cheers, Philip -- Philip Newton
Re: SHA question
On 13 Jan 2010, at 17:53, David Cantrell wrote: [...] > Other hashing algorithms exist and are faster but more prone to > inadvertant collisions. If you've got a lot of data to compare, I'd > use one of them (eg one of the variations on a CRC) and then only > bring out the big SHA guns when that finds a collision. That's a premature optimisation which just complicates the code, unless you mean *a lot* such as in the rdiff algorithm. For de-duping purposes, SHA is still faster than you can pull the files off the disk and a secondary cheaper hash is unnecessary.
Re: SHA question
On Wed, Jan 13, 2010 at 09:53, David Cantrell wrote: > On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote: > >> I am using it in a perl class but if I could system(`fdupes`) that >> might be preferable. I'll try building the sources and see what >> happens. Failing that I'll have to fallback to slurping and SHA or >> MD5. > > Other hashing algorithms exist and are faster but more prone to > inadvertant collisions. If you've got a lot of data to compare, I'd > use one of them (eg one of the variations on a CRC) and then only > bring out the big SHA guns when that finds a collision. Or cmp ;-)
Re: SHA question
On Wed, Jan 13, 2010 at 02:58:59PM +, Dermot wrote: > 2010/1/13 Avi Greenbury : > > Thirdly, be aware of what hashing guarantees. It does *not* guarantee > > uniqueness, it just gives you a very low chance that two files with > > the same hash are different. It does guarantee that files with > > different hashes are different, though. > I think that's the best I can hope for. If that 'duplicate.pdf' turned > up again at least I be able to correctly identify it. That's the goal. > I will give fdupes a look too. Of course, if SHA (or whatever) does give you the same result for two files, verifying that they really are the same is trivial ... (and if they're not, lots of people would be Really Interested to know). -- David Cantrell | Bourgeois reactionary pig When a man is tired of London, he is tired of life -- Samuel Johnson
Re: SHA question
2010/1/13 Paul Makepeace : > On Wed, Jan 13, 2010 at 07:16, Philip Newton wrote: >> On Wed, Jan 13, 2010 at 15:58, Dermot wrote: >>> 2010/1/13 Avi Greenbury : >> >> I think you're putting the cart before the horse. >> >> Did someone come up to you and say, "Dermot, put the SHA value in a >> database."? >> I would have thought that you *need* to make sure that you detect >> duplicate files (for example, to avoid processing "the same" file >> twice). Storing the SHA in an SQLite file is a method you would *like* >> to use to accomplish this, but may not be the only way nor the best >> way. Yet more background. *sign* The process runs as follows: 1) A source submits some digital files. 2) Extract EXIF from digital files that may contain the name of the PDF file. 3) Find said PDF on file system. 3) DB - Have I seen this PDF before? Yes: Assign existing ID to the new row we're creating for it's parent record (the digital file). No: Assign PDF an ID, assign ID to parent record, rename, post/upload to remote server. The same PDF can be come from a number of sources so that are not unique to a source and the same PDF may appear more than one (parent) records. The PDF exists on a remote server after that so, your right, I don't want to process the same file twice. >> Along those lines, you may wish to store the filesize in bytes in your >> database as well, as a first point of comparison; if the filesize is >> unique, then the file must also be unique and you could save yourself >> the time spent calculating a digest of the file's contents -- no >> 1058-byte file can be the same as any 1927-byte file. If I go with byte size and do ('PDF')->search({ file_size => 1058}) and get 3 results I then have to back-track, take the SHA and do to the search again. With SHA, it might be expensive but it's always unique[1] so I can simply do ('PDF')->find_or_new({ \%hash}) and get the ID back. I don't think your suggesting that I relie on the file size as a unique identifer and I can see how a search with no results might short-circuit some stuff. But I will need that SHA when I get files of the same size so I may as well store it from the beginning. > If you're storing the collision data (size, hash, whatever) to protect > against future collisions the only way this scheme of avoiding more > expensive ops like hashing will work (AFAICS) is if you have some > fiddlier code to lazily hash an old file when a newer future file > comes along that matches an existing file size. > >>> Incident I get poor results from the MD5 compared with SHA so I can't >>> relie on MD5 for >> >> That's... odd. md5sum's guarantee of "same if the hashes match" isn't >> as strong as SHA's, but I still wouldn't expect two files to md5sum >> the same if their SHA sums don'T match. >> >> However, those MD5 sums don't look like base-64 to me, so maybe you're >> doing something wrong somewhere. Yes, I'd better 'fess up here. I had a bug :P I was using the hex_b4base() in a not too clever way. I should have been using addfile(). Dp [1] At least once in 1x10^-64
Re: SHA question
On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote: > I am using it in a perl class but if I could system(`fdupes`) that > might be preferable. I'll try building the sources and see what > happens. Failing that I'll have to fallback to slurping and SHA or > MD5. Other hashing algorithms exist and are faster but more prone to inadvertant collisions. If you've got a lot of data to compare, I'd use one of them (eg one of the variations on a CRC) and then only bring out the big SHA guns when that finds a collision. -- David Cantrell | even more awesome than a panda-fur coat Nuke a disabled unborn gay baby whale for JESUS!
Re: SHA question
On Wed, Jan 13, 2010 at 07:16, Philip Newton wrote: > On Wed, Jan 13, 2010 at 15:58, Dermot wrote: >> 2010/1/13 Avi Greenbury : >> >>> You might've missed his point. >>> >>> If two files are of different sizes, they cannot be identical. Getting >>> the size of a file is substantially cheaper than hashing it. >>> >>> So you check all your filesizes, and need only hash those pairs or >>> groups that are all the same size. >> >> Sorry guess I didn't make myself clear. I need to store the SHA in an >> SQLite file. > > I think you're putting the cart before the horse. > > Did someone come up to you and say, "Dermot, put the SHA value in a > database."? > > I would have thought that you *need* to make sure that you detect > duplicate files (for example, to avoid processing "the same" file > twice). Storing the SHA in an SQLite file is a method you would *like* > to use to accomplish this, but may not be the only way nor the best > way. > > Along those lines, you may wish to store the filesize in bytes in your > database as well, as a first point of comparison; if the filesize is > unique, then the file must also be unique and you could save yourself > the time spent calculating a digest of the file's contents -- no > 1058-byte file can be the same as any 1927-byte file. If you're storing the collision data (size, hash, whatever) to protect against future collisions the only way this scheme of avoiding more expensive ops like hashing will work (AFAICS) is if you have some fiddlier code to lazily hash an old file when a newer future file comes along that matches an existing file size. >> Incident I get poor results from the MD5 compared with SHA so I can't >> relie on MD5 for >> >> MD5 (md5_base64) results: >> mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 >> MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 >> duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 >> MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 >> PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 >> mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 >> PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 >> >> SHA (b64digest) results: >> mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27 >> MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 >> duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 >> MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27 >> PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27 >> mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27 >> PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27 > > That's... odd. md5sum's guarantee of "same if the hashes match" isn't > as strong as SHA's, but I still wouldn't expect two files to md5sum > the same if their SHA sums don'T match. > > However, those MD5 sums don't look like base-64 to me, so maybe you're > doing something wrong somewhere. > > Cheers, > Philip > -- > Philip Newton > >
Re: SHA question
On Wed, 13 Jan 2010 at 12:44:47PM +, Dermot wrote: > Hi, > > I have a lots of PDFs that I need to catalogue and I want to ensure > the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned > something similar with SHA1 and binary files. Am I right in thinking > that the code below is only taking the SHA on the name of the file and > if I want to ensure uniqueness of the content I need to do something > similar but as a file blob? > Have a look here: http://en.wikipedia.org/wiki/Fdupes There are links to Perl examples, that do SHA de-duplication. -- Adam Trickett Overton, HANTS, UK A bank is a place where they lend you an umbrella in fair weather and ask for it back when it begins to rain. -- Robert Frost
Re: SHA question
Dan Rowles wrote: Dermot wrote: [snip] Incident I get poor results from the MD5 compared with SHA so I can't relie on MD5 for MD5 (md5_base64) results: mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 I think you must have a bug. Finding three MD5 collisions in seven files that are actually different to each other would be a really remarkable result depends on where the PDFs came from :-) http://www.win.tue.nl/hashclash/Nostradamus/ --matt -- Matthew Boyle, Systems Administrator, CoreFiling Limited Telephone: +44-1865-203192 Website: http://www.corefiling.com
Re: SHA question
Dermot wrote: [snip] Incident I get poor results from the MD5 compared with SHA so I can't relie on MD5 for MD5 (md5_base64) results: mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 I think you must have a bug. Finding three MD5 collisions in seven files that are actually different to each other would be a really remarkable result Dan
Re: SHA question
On Wed, Jan 13, 2010 at 02:25:58PM +, Alexander Clouter wrote: >The following gives the duplicated hashes (you might prefer '-D' instead >of '-d'): But does not take account of hardlinks, and again hashes every file rather than just the ones that might be duplicates. R
Re: SHA question
On 13 Jan 2010, at 14:58, Dermot wrote: > MD5 (md5_base64) results: > mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 > MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 > mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 > PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 Oh and run them through md5 in the shell to see what you get - the results should be the same. -- Andy Armstrong, Hexten
Re: SHA question
On 13 Jan 2010, at 14:58, Dermot wrote: > Incident I get poor results from the MD5 compared with SHA so I can't > relie on MD5 for > > MD5 (md5_base64) results: > mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 > MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 > mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 > PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 If those files are different you're doing it wrong :) -- Andy Armstrong, Hexten
Re: SHA question
On Wed, Jan 13, 2010 at 15:58, Dermot wrote: > 2010/1/13 Avi Greenbury : > >> You might've missed his point. >> >> If two files are of different sizes, they cannot be identical. Getting >> the size of a file is substantially cheaper than hashing it. >> >> So you check all your filesizes, and need only hash those pairs or >> groups that are all the same size. > > Sorry guess I didn't make myself clear. I need to store the SHA in an > SQLite file. I think you're putting the cart before the horse. Did someone come up to you and say, "Dermot, put the SHA value in a database."? I would have thought that you *need* to make sure that you detect duplicate files (for example, to avoid processing "the same" file twice). Storing the SHA in an SQLite file is a method you would *like* to use to accomplish this, but may not be the only way nor the best way. Along those lines, you may wish to store the filesize in bytes in your database as well, as a first point of comparison; if the filesize is unique, then the file must also be unique and you could save yourself the time spent calculating a digest of the file's contents -- no 1058-byte file can be the same as any 1927-byte file. > Incident I get poor results from the MD5 compared with SHA so I can't > relie on MD5 for > > MD5 (md5_base64) results: > mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 > MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 > PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 > mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 > PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 > > SHA (b64digest) results: > mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27 > MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 > duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 > MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27 > PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27 > mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27 > PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27 That's... odd. md5sum's guarantee of "same if the hashes match" isn't as strong as SHA's, but I still wouldn't expect two files to md5sum the same if their SHA sums don'T match. However, those MD5 sums don't look like base-64 to me, so maybe you're doing something wrong somewhere. Cheers, Philip -- Philip Newton
Re: SHA question
Roger Burton West wrote: > > You may want to be slightly cleverer about it - taking a SHAsum is > computationally expensive, and it's only worth doing if the files have > the same size. > > If you don't require a pure-Perl solution, bear in mind that all this > has been done for you in the "fdupes" program, already in Debian or at > http://netdial.caribe.net/~adrian2/programs/ . > *sigh* The following gives the duplicated hashes (you might prefer '-D' instead of '-d'): md5sum /path/to/pdfs | sort | uniq -d Replace the '-d' with '-u' if you want to just see the unique ones. I'll leave it as an exercise for the reader to pipe the output of '-D' into some xarg action to 'rm' and 'ln -s' the duplicates. Cheers -- Alexander Clouter .sigmonster says: For fast-acting relief, try slowing down.
Re: SHA question
On 13 Jan 2010, at 14:40, Philip Newton wrote: [...] > Well, that said, is the "very low chance" not on the order of the > chance that you'll be run over by a bus in the morning, or that one of > the files will be changed through cosmic rays or bit rot in the > magnetic domains of the hard disk platter? In the case of SHA-256, the odds are low enough that the universe is likely to end before you find a collision.
Re: SHA question
2010/1/13 Avi Greenbury : > You might've missed his point. > > If two files are of different sizes, they cannot be identical. Getting > the size of a file is substantially cheaper than hashing it. > > So you check all your filesizes, and need only hash those pairs or > groups that are all the same size. Sorry guess I didn't make myself clear. I need to store the SHA in an SQLite file. I have a few files to handle now but I will get a constant dribble from now on. I want to try and ensure that I haven't already databased a file that I'll process in the future. Incident I get poor results from the MD5 compared with SHA so I can't relie on MD5 for MD5 (md5_base64) results: mr_485_htu_AST.pdf 116caa6cc1705db23a36feb11c8c4113 32 MR_2891.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 duplicate.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 MR_2898.pdf 01f73c142dae9f9f403bbab543b6aa6f 32 PR_A02.pdf 5552e6587357f9967dc0bc83153cca63 32 mr_485_htu_hrt.pdf 116caa6cc1705db23a36feb11c8c4113 32 PR_A01.pdf 5552e6587357f9967dc0bc83153cca63 32 SHA (b64digest) results: mr_485_htu_AST.pdf PqsBpkKgGxdEHvkoNyou1NV5kuY 27 MR_2891.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 duplicate.pdf bQhWA445KFzXy6ldF/DSoG2xTEY 27 MR_2898.pdf ULBRZQB00qZIfIWD7oqdpfVpFtw 27 PR_A02.pdf 6LdF6sWZnyLdWj44inFI6MSaUY4 27 mr_485_htu_hrt.pdf 0VNwG7IiaIneEX3jh3SBUBaXMK0 27 PR_A01.pdf JS33nJhzTo9YTqRWe01xnOb6bEM 27 > Thirdly, be aware of what hashing guarantees. It does *not* guarantee > uniqueness, it just gives you a very low chance that two files with > the same hash are different. It does guarantee that files with > different hashes are different, though. > I think that's the best I can hope for. If that 'duplicate.pdf' turned up again at least I be able to correctly identify it. That's the goal. I will give fdupes a look too. Thanks all. Dp.
Re: SHA question
On Wed, Jan 13, 2010 at 15:06, James Laver wrote: > Thirdly, be aware of what hashing guarantees. It does *not* guarantee > uniqueness, it just gives you a very low chance that two files with > the same hash are different. Well, that said, is the "very low chance" not on the order of the chance that you'll be run over by a bus in the morning, or that one of the files will be changed through cosmic rays or bit rot in the magnetic domains of the hard disk platter? In other words, is 1x10^-64 (or whatever it might be) not so small as to be effectively zero, since there are much "higher" risks (say, 1x10^-32) which you do not guard against, either? Cheers, Philip -- Philip Newton
Re: SHA question
On Wed, Jan 13, 2010 at 1:46 PM, Dermot wrote: > 2010/1/13 Roger Burton West : > >>>I am using it in a perl class >> >> So I won't point out the implications, but there's an obvious one which >> will make your life easier. > > You can't leave me hanging there > Dp. > Well, there are a few things... Firstly, you are indeed just hashing the filename, not the file contents. Secondly, you're using Digest::SHA directly. The Digest:: series of modules are meant to be used through the 'Digest' interface as in the example Steffan gave. Doing this will make your life easier in most cases (by providing a standard interface across almost all digest algorithms and making it easy to switch (though ::Whirlpool disobeys the rules of the interface :/ )) and provides the handy addfile method you're looking for. Thirdly, be aware of what hashing guarantees. It does *not* guarantee uniqueness, it just gives you a very low chance that two files with the same hash are different. It does guarantee that files with different hashes are different, though. Lastly, as regards on-topicness, Perl is definitely off-topic. Beer, Pies, Dim Sum and Buffy are on-topic.* On topic: Buffy eating a dim sum pie and washing it down with beer. --James * But you can still post perl here.
Re: SHA question
Dermot wrote: > 2010/1/13 Roger Burton West : > > You may want to be slightly cleverer about it - taking a SHAsum is > > computationally expensive, and it's only worth doing if the files > > have the same size. > > Unfortunately the size varies quite a bit. You might've missed his point. If two files are of different sizes, they cannot be identical. Getting the size of a file is substantially cheaper than hashing it. So you check all your filesizes, and need only hash those pairs or groups that are all the same size. -- Avi Greenbury
Re: SHA question
2010/1/13 Luis Motta Campos : > I believe the official answer to this question would be "The London Perl > Mongers list considers on-topic messages that talk about Ponies, Buffy, > Beer, and Pie. Everything else should be tagged as 'off-toppic'". There is even a FAQ about this: http://london.pm.org/about/faq.html#topic Having said that, I've been lurking here a few months now and I've seen very little talk of any of the aforementioned topics D: Phil
Re: SHA question
2010/1/13 Roger Burton West : >>I am using it in a perl class > > So I won't point out the implications, but there's an obvious one which > will make your life easier. You can't leave me hanging there Dp.
Re: SHA question
On Wed, Jan 13, 2010 at 01:12:28PM +, Dermot wrote: >Unfortunately the size varies quite a bit. There are a few 11Mb pdfs >but the majority are under 1mb. No, that's _good_. >I am using it in a perl class So I won't point out the implications, but there's an obvious one which will make your life easier. R
Re: SHA question
Dermot wrote at 12:44 on 2010-01-13: > Hi, > > I have a lots of PDFs that I need to catalogue and I want to ensure > the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned > something similar with SHA1 and binary files. Am I right in thinking > that the code below is only taking the SHA on the name of the file and > if I want to ensure uniqueness of the content I need to do something > similar but as a file blob? Yes, that looks about right. From a brief look at http://perldoc.perl.org/Digest/SHA.html it appears that you may want my $sha = Digest::SHA->new(512); $sha->addfile($n); $digest=$sha->digest; # or hexdigest or b64digest in your inner loop. S
Re: SHA question
Dermot wrote: > Hi, > > I have a lots of PDFs that I need to catalogue and I want to ensure > the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned > something similar with SHA1 and binary files. Am I right in thinking > that the code below is only taking the SHA on the name of the file > and if I want to ensure uniqueness of the content I need to do > something similar but as a file blob? > > [code was here] > Yes, your code processes file names, not file contents. > PS: I don't see many perl questions here, am I breaking a convention? I believe the official answer to this question would be "The London Perl Mongers list considers on-topic messages that talk about Ponies, Buffy, Beer, and Pie. Everything else should be tagged as 'off-toppic'". As I'm really bad at remembering things and also a non-native speaker, YMMV, wording- and semantic-wise. Cheers -- Luis Motta Campos is a software engineer, Perl Programmer, foodie and photographer.
Re: SHA question
2010/1/13 Roger Burton West : > On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote: > >>I have a lots of PDFs that I need to catalogue and I want to ensure >>the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned >>something similar with SHA1 and binary files. Am I right in thinking >>that the code below is only taking the SHA on the name of the file and >>if I want to ensure uniqueness of the content I need to do something >>similar but as a file blob? > > Yes. > > You may want to be slightly cleverer about it - taking a SHAsum is > computationally expensive, and it's only worth doing if the files have > the same size. Unfortunately the size varies quite a bit. There are a few 11Mb pdfs but the majority are under 1mb. This application isn't for public consumption so I don't have to worry about speed. However there are other services on the server and I wouldn't want to blindly slurp a 50mb pdf I guess. > If you don't require a pure-Perl solution, bear in mind that all this > has been done for you in the "fdupes" program, already in Debian or at > http://netdial.caribe.net/~adrian2/programs/ . I am using it in a perl class but if I could system(`fdupes`) that might be preferable. I'll try building the sources and see what happens. Failing that I'll have to fallback to slurping and SHA or MD5. Thanx, Dp.
Re: SHA question
On Wed, Jan 13, 2010 at 12:44:47PM +, Dermot wrote: >I have a lots of PDFs that I need to catalogue and I want to ensure >the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned >something similar with SHA1 and binary files. Am I right in thinking >that the code below is only taking the SHA on the name of the file and >if I want to ensure uniqueness of the content I need to do something >similar but as a file blob? Yes. You may want to be slightly cleverer about it - taking a SHAsum is computationally expensive, and it's only worth doing if the files have the same size. If you don't require a pure-Perl solution, bear in mind that all this has been done for you in the "fdupes" program, already in Debian or at http://netdial.caribe.net/~adrian2/programs/ . Roger
SHA question
Hi, I have a lots of PDFs that I need to catalogue and I want to ensure the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned something similar with SHA1 and binary files. Am I right in thinking that the code below is only taking the SHA on the name of the file and if I want to ensure uniqueness of the content I need to do something similar but as a file blob? Thanx, Dp. use strict; use warnings; use Digest::SHA qw(sha256_hex); use FindBin qw($Bin); my $top = "$Bin/pdfs"; opendir my $dir, "$top" or die "Can't open $top: $!\n"; my @files = grep { /pdf$/ } readdir $dir; foreach my $n (@files) { if (-e "$top/$n" }) ) { my $digest = sha256_hex($n); print "$n\t$digest\t:". length($digest)."\n"; } else { print "Can't find $top/$n\n"; } } PS: I don't see many perl questions here, am I breaking a convention?