2010/1/13 Roger Burton West <ro...@firedrake.org>: > On Wed, Jan 13, 2010 at 12:44:47PM +0000, Dermot wrote: > >>I have a lots of PDFs that I need to catalogue and I want to ensure >>the uniqueness of each PDF. At LWP, Jonathan Rockway mentioned >>something similar with SHA1 and binary files. Am I right in thinking >>that the code below is only taking the SHA on the name of the file and >>if I want to ensure uniqueness of the content I need to do something >>similar but as a file blob? > > Yes. > > You may want to be slightly cleverer about it - taking a SHAsum is > computationally expensive, and it's only worth doing if the files have > the same size.
Unfortunately the size varies quite a bit. There are a few 11Mb pdfs but the majority are under 1mb. This application isn't for public consumption so I don't have to worry about speed. However there are other services on the server and I wouldn't want to blindly slurp a 50mb pdf I guess. > If you don't require a pure-Perl solution, bear in mind that all this > has been done for you in the "fdupes" program, already in Debian or at > http://netdial.caribe.net/~adrian2/programs/ . I am using it in a perl class but if I could system(`fdupes`) that might be preferable. I'll try building the sources and see what happens. Failing that I'll have to fallback to slurping and SHA or MD5. Thanx, Dp.