Great work, Jonathan!
Notice how simple your script has become - and that's a good sign as well
in Perl. :) We can make it even simpler, however.

As you probably know, Perl has two fundamental types of collections: arrays
(where data is stored as a sequence of elements, data chunks) and hashes
(where data chunks are unordered, but stored with some unique key used to
retrieve it). Sometimes hashes are used just to sort out (non-)unique data,
but that's another story.

Now look at this line:
>       push @{$files{$filesize}}, $File::Find::name;

Don't you see something... weird? You're using hash where filesizes are the
keys - and because, yes, they may well be non-unique, you have to store
arrays of filenames in your hash instead...

But much more natural (at least, for me) is to organize your hash (let's
call it %filedata) so that filenames (which are unique by their nature)
become the keys. And some info about these files - sizes and md5-hashes -
become the values.

For example, our `wanted` (btw, its name is misleading a bit, no? may be
'process' will sound better?) sub may look as follows:

find(\&wanted, $path);

my %filedata;
sub wanted {
  return if substr($_, 0, 1) eq '.' || -d $_;
  my $filesize = -s _;
  open my $fh, '<', $_ or die $!, $/;
  my $filemd5 = Digest::MD5->new->addfile($fh)->hexdigest;
  close $fh;
  $filedata{$_} = [$filesize, $filemd5];
}

(*Notice how you don't have to declare the global filedata hash before the
callback function is called in `find`? It's really interesting topic*)

Then you'll just have to iterate over the %filedata - and it's as easy as
writing...

for my $filename (keys %filedata) {
  my ($size, $md5) = @{ $filedata{$filename} };
  open my $fh, '>', File::Spec->catfile($path, "$filename.md5")
    or die $!, $/;
  print $fh "$filename\t$size bytes\t$md5\n";
  close $fh;
}

... yep, that easy. )

-- iD

P.S. Ordering the output should be an easy task for you; hint - look up
'sort' documentation - or just use sort system routine. :)

2011/12/31 Jonathan Harris <jtnhar...@googlemail.com>

> HI Brandon
>
> Thanks for your response
>
>
> I totally agree with your first point
> Having now used File::Find a little more, I have seen that using opendir
> was totally unneccessary and have removed them from the script
> And guess what.....it works fine without them!
> I think that my initial confusion arose from fundamentally misunderstanding
> File::Find - thinking that it required a handle, not just a path.
>
> I have also now exchanged the while loop with a foreach loop - much better!
>
> >> I assume you meant `hexdigest' here, not 'hex digest'.
>
> You assume correctly! Gmail has started to do annoying auto text complete -
> I must turn it off!!
>
> >> push @files, $File::Find::name if -f $_;
>
> This is nice and clean
>
> Your approach is different to what we have been discussing
> You seem to gather the files with File::Find and then leave that sub alone
> asap
> The processing is then done in the results of that gathering
>
> My script left the processing within sub wanted
> This could possible be a reason that complications arose so quickly
>
> To get file names and sizes at the same time, I am also considering
>
> my %files;
>
> sub wanted {
>    my $filesize = (stat($_))[7];
>    push @{$files{$filesize}}, $File::Find::name;
> }
>
> find(\&wanted, $path);
>
> to hash files and file size results together - then process after
>
> And yep, Igor has been thorough and very helpful
>
> Thanks again for your input on this - hope you manage to get some sleep!
>
> All the best
>
> Jonathan
>

Reply via email to