Re: File Size Script Help - Working Version

2012-01-03 Thread Igor Dovgiy
Hi folks, happy new year to everyone. )

John, you're right, of course. ) The filenames in nested directories could
well overlap, and using $File::Find::name would be safer.
Didn't think of that as a big problem, though, as original script (with
'opendir') ignored all the nested folders overall.

Jonathan, no, you don't have to store the filenames as array: complete
pathnames of the file can't be repeated.
It'll be sufficient just to change this line:
$filedata{$_} = [$filesize, $filemd5]
for
$filedata{$File::Find::name} = [$filesize, $filemd5]

(and replace catfile in the writing block as well, as %filedata keys will
now be the whole filenames themselves).

On sorting: cmp and = is not the same: former compares strings, latter -
numbers.
So, for example, 'abc' cmp 'def' gives you -1, but 'abc' = 'def' gives 0
(and warnings about non-numeric args as well).

It's nice to know the difference, but... do you really need to sort the
output in your script? What output? )
It makes no difference in what order your .md5 files will be created,
right? And you don't need to print the list of files processed
(as I did in my test script, that's why the ordering was ever mentioned).

As for $_, the problem John mentioned is logical, not 'perlical': as $_
variable is assigned a filename processed within File::Find target sub,
and files in different directories could have the same names (but not full
names, with absolute path attached), it may cause a bit of confusion when
they DO have the same names. )

Generally speaking, $_ usage is for comfort and speed (yes, thinking of it
as of 'it' word is right )). Of course, you can reassign it, but it'll make
your scripts bigger (sometimes _much_ bigger) without adding much clarity,
in my opinion. But that, again, is a matter of taste.

For me, I use $_ almost every time I process shallow collections (hashes or
arrays, doesn't matter). When two-level (or more complex) data structure is
processed, it's usually required to use a temporary variable - but even
then inner layers can be iterated with $_ easily.

-- iD


Re: File Size Script Help - Working Version

2012-01-03 Thread Brandon McCaig
Hello:

On Sat, Dec 31, 2011 at 02:56:50AM +0200, Igor Dovgiy wrote:
   $filedata{$_} = [$filesize, $filemd5];
*snip*
   my ($size, $md5) = @{ $filedata{$filename} };

Alternatively, store a nested hash-reference:

$filedata{$File::Find::name} = {
md5 = $file_md5,
size = $file_size,
};

# ...

my ($size, $md5) = @{$filedata{$filename}}{qw/size md5/};

That way you don't need to remember which [arbitrary] order
they're in, and if the order changes, or more fields are added,
the meaning of subsequent code doesn't change.

Regards,


-- 
Brandon McCaig bamcc...@gmail.com bamcc...@castopulence.org
Castopulence Software https://www.castopulence.org/
Blog http://www.bamccaig.com/
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'



signature.asc
Description: Digital signature


Re: File Size Script Help - Working Version

2012-01-02 Thread Jonathan Harris
On Sat, Dec 31, 2011 at 4:29 AM, John W. Krahn jwkr...@shaw.ca wrote:

 Igor Dovgiy wrote:

 Great work, Jonathan!
 Notice how simple your script has become - and that's a good sign as well
 in Perl. :) We can make it even simpler, however.

 As you probably know, Perl has two fundamental types of collections:
 arrays
 (where data is stored as a sequence of elements, data chunks) and hashes
 (where data chunks are unordered, but stored with some unique key used to
 retrieve it). Sometimes hashes are used just to sort out (non-)unique
 data,
 but that's another story.

 Now look at this line:

   push @{$files{$filesize}}, $File::Find::name;


 Don't you see something... weird? You're using hash where filesizes are
 the
 keys - and because, yes, they may well be non-unique, you have to store
 arrays of filenames in your hash instead...

 But much more natural (at least, for me) is to organize your hash (let's
 call it %filedata) so that filenames (which are unique by their nature)
 become the keys. And some info about these files - sizes and md5-hashes -
 become the values.


 Yes, file names in a given directory _have_ to be unique, however...



  For example, our `wanted` (btw, its name is misleading a bit, no? may be
 'process' will sound better?) sub may look as follows:

 find(\wanted, $path);

 my %filedata;
 sub wanted {
   return if substr($_, 0, 1) eq '.' || -d $_;
   my $filesize = -s _;
   open my $fh, '', $_ or die $!, $/;
   my $filemd5 = Digest::MD5-new-addfile($fh)**-hexdigest;
   close $fh;
   $filedata{$_} = [$filesize, $filemd5];


 You are traversing a directory tree, so using $_ as the key may cause
 collisions across different directories.  Better to use $File::Find::name
 which contains the full absolute path name.




 John
 --
 Any intelligent fool can make things bigger and
 more complex... It takes a touch of genius -
 and a lot of courage to move in the opposite
 direction.   -- Albert Einstein

 --
 To unsubscribe, e-mail: beginners-unsubscr...@perl.org
 For additional commands, e-mail: beginners-h...@perl.org
 http://learn.perl.org/



Hi to all on the list still following this thread - and Happy New Year!

Igor...Thanks!!   : )

It does feel like there has been some really good Perl learning progress
being made here - and yep, I cannot believe how trimmed down the script has
now become.
Looking back on the original script makes me laugh!
I wonder if that will become a consistent theme when writing?!

Looking back to the hash - I agree that it makes far more sense to have the
filenames as the keys

Quoting yourself and John:

  filenames (which are unique by their nature)
 Yes, file names in a given directory _have_ to be unique.

I think that we can all be in agreement then that these entries should be
guaranteed to have unique keys and  can have non-unique data such as file
size attributed to them- therefore:

push @{$files{$filename}}, $File::Find::name;

When sorting the hash, there seems to well established code for this

eg:

sorting by file size:

foreach (sort {$filedata{$b} = $filedata{$a}} keys %filedata {   ##
should sort so that the highest value file size is first
...
}

As far as I'm aware, = and cmp are the same thing
Is there a question of precedence over them?
I assume that = has a higher precedence

Interestingly, this is now the second time in this thread that we have been
warned against using $_

From John:

 $_ as the key may cause collisions across different directories

From Shlomi:

 The $_ variable can be easily devastated. You should use a lexical one.

I believe that I understand the use of $_ as the default variable; indeed,
the documentation on CPAN about File::Find states the usage of $_ in the
module

However, it seems that it is a variable that it so easily destroyed and
many have warned against using it

If this is the case, why would we choose (or be required) to use it in the
first place?

I have read the 'Elements to Avoid' page, as recommended by Shlomi

http://perl-begin.org/tutorials/bad-elements/

which is very useful

Would it be correct to say that $_ should be re-assigned asap whenever
using Perl?
I couldn't find any exceptions that state that it is ok to use it

#

Sincere thanks again to you all for your contributions
I hope that others reading this list are learning as much as I am!

All the best

Jonathan


Re: File Size Script Help - Working Version

2011-12-30 Thread Igor Dovgiy
Hi Jonathan,

Argh, really stupid mistake by me. ) But let's use it to explain some
points a bit further, shall we?
A skilled craftsman knows his tools well, and Perl programmer (with CPAN as
THE collection of tools of all sizes and meanings) has an advantage here: even
if documentation is a bit vague about what's going on, we are (usually)
able to check the code itself to find the answers. )

By browsing File::Spec source (found via 'Source' link within the
'File::Spec' page at CPAN)...
http://cpansearch.perl.org/src/SMUELLER/PathTools-3.33/lib/File/Spec.pm
...we soon discover that this module is essentially an adapter for modules
like File::Spec::Unix, File::Spec::Mac, File::Spec::Win32 etc.
So our search goes on (as your mention of .DS_Store file implies) over
there:
http://cpansearch.perl.org/src/SMUELLER/PathTools-3.33/lib/File/Spec/Mac.pm

Now we may either check the documentation (which clearly states that only
the last argument to catfile is considered a filename, and all the others
will be concatenated with catdir), or look right into the code - and come
to the same conclusion:

sub catfile {
my $self = shift;
return '' unless @_;
my $file = pop @_;
return $file unless @_;
my $dir = $self-catdir(@_);
$file =~ s/^://s;
return $dir.$file;
}

So what should we do now? ) Of course, give milk to our cat... and
arguments to File::Spec's catfile! ) Like this:

File::Spec-catfile($path, $dircontents . '.md5')
... or this...
File::Spec-catfile($path, $dircontents.md5)
(check 'variable interpolation in Perl' to see why it's possible - and why
this is essentially the same as previous codeline)
... or even this ...
File::Spec-catfile($path, join '.', $dircontents, 'md5')
(but that would be a bit overkill, of course :)

Speaking of overkills: you used regex (=~ /^\./) to check whether the line
begins with a dot - or not. )
It's ok for this task, but you probably should know that these checks may
be also done with (substr($line, 0, 1) eq '.') code,
which will be a bit (up to 30% at my PC when Benchmark'ed) faster.

-- iD

2011/12/30 Jonathan Harris jtnhar...@googlemail.com

 I tried to use your suggestion
 open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
 die $!, $/
 but it returned an error on the command line:
  'Not a directory'
  At which point the program dies (which is what it is supposed to do!)
  I used it inside the loop - sorry to bug you for clarification


 
 if ($dircontents=~/^\./ || -d $dircontents) {
  next;
 }

 This is also to avoid the file .DS_Store




Re: File Size Script Help - Working Version

2011-12-30 Thread Igor Dovgiy
Hi John, yes, good point! Totally forgot this. ) Adding new files to a
directory as you browse it is just not right, of course. Possible, but not
right. )

I'd solve this by using hash with filenames as keys and collected 'result'
strings (with md5 and filesizes) as values, filled by File::Find target
routine.
After the whole directory is processed, this hash should be 'written out'
into the target directory.

Another way to do it is to collect all the filenames instead into a list
(with glob operator, for example), and process this list after.

BTW (to Jonathan), I wonder do you really need to store this kind of data
in different files? No offence... but I can hardly imagine how this data
will be used later unless gathered into some array or hash. )

-- iD

2011/12/30 John W. Krahn jwkr...@shaw.ca

 Jonathan Harris wrote:


  Hi John

 Thanks for your 2 cents

 I hadn't considered that the module wouldn't be portable


 That is not what I was implying.  I was saying that when you add new files
 to a directory that you are traversing you _may_ get irregular results.  It
 depends on how your operating system updates directory entries.




Re: File Size Script Help - Working Version

2011-12-30 Thread Jonathan Harris
On Fri, Dec 30, 2011 at 11:58 AM, Igor Dovgiy ivd.pri...@gmail.com wrote:

 Hi John, yes, good point! Totally forgot this. ) Adding new files to a
 directory as you browse it is just not right, of course. Possible, but not
 right. )

 I'd solve this by using hash with filenames as keys and collected 'result'
 strings (with md5 and filesizes) as values, filled by File::Find target
 routine.
 After the whole directory is processed, this hash should be 'written out'
 into the target directory.

 Another way to do it is to collect all the filenames instead into a list
 (with glob operator, for example), and process this list after.

 BTW (to Jonathan), I wonder do you really need to store this kind of data
 in different files? No offence... but I can hardly imagine how this data
 will be used later unless gathered into some array or hash. )

 -- iD

 2011/12/30 John W. Krahn jwkr...@shaw.ca

  Jonathan Harris wrote:
 
 
   Hi John
 
  Thanks for your 2 cents
 
  I hadn't considered that the module wouldn't be portable
 
 
  That is not what I was implying.  I was saying that when you add new
 files
  to a directory that you are traversing you _may_ get irregular results.
  It
  depends on how your operating system updates directory entries.
 
 



Hi All

John -
Thanks for the clarification
In this instance the script has been run on OSX - it seems that adding the
files into the directory that is being traversed works ok this time

However for best practice, I would certainly look into writing to a
separate directory, and then moving the files back, as I appreciate that
this fortune may not necessarily be repeated in a different environment!

Igor -
Firstly -  File::Spec
Thanks for your insight and well explained investigation - I have been
learning a lot from this
File::Spec has proven a most useful tool in joining and 'stringifying' the
paths

In the original post about this script, I had spoken about considering
using a hash for the file data
I'm still convinced that ultimately, this would be the way forwards

I have found some scripts online concerning finding duplicate files
They use md5 and/or file sizes to compare the files
These are written into hashes
Fully understanding some of these scripts is a little beyond my level at
the moment

I have attached an interesting one for you to look at (you may be aware of
it already!)
However, it has proved quite inspiring!

 (substr($line, 0, 1) eq '.')

Haven't learned this yet!
It looks like a good solution if it is so much more efficient - thanks for
the introduction - I'll be reading up asap!

BTW (to Jonathan), I wonder do you really need to store this kind of data
in different files? No offence... but I can hardly imagine how this data
will be used later unless gathered into some array or hash. )

There is a good reason for this!
Talking to guys who work in video on demand, it seems that it is standard
practice to do this for file delivery requirements
As each video file must be identical upon receipt as it was upon delivery (
and that the files are all treated as unique delivery instances )
a separate accompanying file is required
I thought that Perl would be a good choice for accomplishing this
requirement as it is renowned for file handling

#

Thanks to everyone for your help and contributions - particularly Jim,
Shlomi, John and Igor
I have learned crazy amounts already!

Happy New Year to you all!

Jonathan


finddupes3.plx
Description: Binary data
-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Re: File Size Script Help - Working Version

2011-12-30 Thread Brandon McCaig
On Thu, Dec 29, 2011 at 03:43:19PM +, Jonathan Harris wrote:
 Hi All

Hello Jonathan:

(Disclaimer: I stayed up all night playing Skyrim and am running
on about 4.5 hours of sleep.. ^_^)

I think most things have already been addressed, but I think Igor
might have had a bit of trouble making it clear.

 opendir (my $in, $path) or die Cannot open $dir: $!\n;
 find (\wanted, $path);
 close $in;
 
 opendir (my $totalin, $path) or die Cannot open $dir: $!\n;
 find (\cleanup, $path);
 close $totalin;

AFAICT, it's completely nonsensical to open a directory file
handle surrounding File::Find::find. I tried to /search the
perldoc (just in case there's some kind of magical optimization
or something) and saw no mention of 'opendir' or 'handle' (except
for the special _ file handle created for stat, lstat, etc..). So
it seems $in and $totalin are completely unnecessary here:
File::Find will worry about opening and processing the
directories for you.

 sub wanted {
  while ($dircontents = readdir($in)) {

I guess this is why you are opening directory handles above, but
it doesn't really make sense. You're basically only using
File::Find to loop at this point, and very obscurely. :)
File::Find's role in life is precisely to find you all the files
within a directory tree. You're reinventing the square wheel with
your use of opendir and readdir. :)

The wanted subroutine is typically used to either process the
file system tree outright, or store applicable files in data
structures for later processing. E.g.,

use strict;
use warnings;
use File::Find;

my @files;

sub wanted
{
# Skip dot files and directories.
return if substr($_, 0, 1) eq '.' || -d $_;

# If current file is a normal file, push into array for
# later.
push @files, $File::Find::name if -f $_;
}

my $path = '.';

find \wanted, $path;

# Now @files should be filled with a recursive list of files to
# process. E.g.,

for my $file (@files)
{
my $md5name = $file.'md5';
# Etc...
}

 my $hex = Digest::MD5-new-addfile($fh)-hex digest;

I assume you meant `hexdigest' here, not 'hex digest'.

 $newname =~ s/\ //;

Ideally if you're going to do something as obscure as this, you
should comment it in both places so future readers and
maintainers understand why it's done, even if they only read one
half of the program.

I think Igor else has already explained how to eliminate this
obscurity though. :)

Regards,


-- 
Brandon McCaig bamcc...@gmail.com bamcc...@castopulence.org
Castopulence Software https://www.castopulence.org/
Blog http://www.bamccaig.com/
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'



signature.asc
Description: Digital signature


Re: File Size Script Help - Working Version

2011-12-30 Thread Jonathan Harris
On Fri, Dec 30, 2011 at 7:11 PM, Brandon McCaig bamcc...@gmail.com wrote:

 On Thu, Dec 29, 2011 at 03:43:19PM +, Jonathan Harris wrote:
  Hi All

 Hello Jonathan:

 (Disclaimer: I stayed up all night playing Skyrim and am running
 on about 4.5 hours of sleep.. ^_^)

 I think most things have already been addressed, but I think Igor
 might have had a bit of trouble making it clear.

  opendir (my $in, $path) or die Cannot open $dir: $!\n;
  find (\wanted, $path);
  close $in;
 
  opendir (my $totalin, $path) or die Cannot open $dir: $!\n;
  find (\cleanup, $path);
  close $totalin;

 AFAICT, it's completely nonsensical to open a directory file
 handle surrounding File::Find::find. I tried to /search the
 perldoc (just in case there's some kind of magical optimization
 or something) and saw no mention of 'opendir' or 'handle' (except
 for the special _ file handle created for stat, lstat, etc..). So
 it seems $in and $totalin are completely unnecessary here:
 File::Find will worry about opening and processing the
 directories for you.

  sub wanted {
   while ($dircontents = readdir($in)) {

 I guess this is why you are opening directory handles above, but
 it doesn't really make sense. You're basically only using
 File::Find to loop at this point, and very obscurely. :)
 File::Find's role in life is precisely to find you all the files
 within a directory tree. You're reinventing the square wheel with
 your use of opendir and readdir. :)

 The wanted subroutine is typically used to either process the
 file system tree outright, or store applicable files in data
 structures for later processing. E.g.,

 use strict;
 use warnings;
 use File::Find;

 my @files;

 sub wanted
 {
# Skip dot files and directories.
return if substr($_, 0, 1) eq '.' || -d $_;

# If current file is a normal file, push into array for
# later.
push @files, $File::Find::name if -f $_;
 }

 my $path = '.';

 find \wanted, $path;

 # Now @files should be filled with a recursive list of files to
 # process. E.g.,

 for my $file (@files)
 {
my $md5name = $file.'md5';
# Etc...
 }

  my $hex = Digest::MD5-new-addfile($fh)-hex digest;

 I assume you meant `hexdigest' here, not 'hex digest'.

  $newname =~ s/\ //;

 Ideally if you're going to do something as obscure as this, you
 should comment it in both places so future readers and
 maintainers understand why it's done, even if they only read one
 half of the program.

 I think Igor else has already explained how to eliminate this
 obscurity though. :)

 Regards,


 --
 Brandon McCaig bamcc...@gmail.com bamcc...@castopulence.org
 Castopulence Software https://www.castopulence.org/
 Blog http://www.bamccaig.com/
 perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
 q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
 tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'


 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.10 (GNU/Linux)

 iQIcBAEBAgAGBQJO/gzlAAoJEN2n1gIi5ZPyMJoP/15LI4aelHowMiNQYFzLB2E1
 nEAjPHvvHX7uTxLV8BOht9HtoQPpOwyQ/fJm2EIe59NGwjjvlKCdpCrfkfxz3cvV
 td0wDuUKrjzQPaACl4cNrGITZLVXe6KtZUKG2o3TjA5dlyqbes5d5F40Mh3j/fB5
 L0VZpWNvp0cTrJ6x5QfTkvyQPdxKx0ARaFDQYpvR3uKfgeD28ZNatNCQQuuymLkj
 ABVqLzQgVVMinaj/4xXii5vedgYFI58DPWF7r0nmhUaiVvDAEFzfd1MlSvkuI8Jr
 EKTQdz3EjMZufRaGxX96rdZvVMEiSTcA/IXNkhis48dOwa4ebfSYm8QpQhp1E6UF
 QuJBRXzF9cHvD095Aw+MSqoSR+2LZS3SFBfdB9I5rdxJEoS7LMSJdJLY95dXftqR
 KdlcU7ds4kdqaJrnxrxB05SIhgZNq1JcCk9xZyuk7NxsPwl/6ZStr/E/V2mA+bdD
 bGWCuP7IKRfZ8cNd01tEYnUppYkfxRaNtYhlNmPVh6TX7rd+2Z4aZLn0anR3Q/J4
 2OsPmtIOakk1jW6f41vTOqZ5cpxSv/R1H6fjZAOVACf7UlrtCgVWVLKYofcO/ryS
 HMsX1T3m3h9H3heSd76FiXMsgDBaMBkti31FzwnS3pOaZEpHk6eHF/dbR2p0sAbA
 gTvG4xDp2MLEtY03Ofhw
 =RFiR
 -END PGP SIGNATURE-



HI Brandon

Thanks for your response


I totally agree with your first point
Having now used File::Find a little more, I have seen that using opendir
was totally unneccessary and have removed them from the script
And guess what.it works fine without them!
I think that my initial confusion arose from fundamentally misunderstanding
File::Find - thinking that it required a handle, not just a path.

I have also now exchanged the while loop with a foreach loop - much better!

 I assume you meant `hexdigest' here, not 'hex digest'.

You assume correctly! Gmail has started to do annoying auto text complete -
I must turn it off!!

 push @files, $File::Find::name if -f $_;

This is nice and clean

Your approach is different to what we have been discussing
You seem to gather the files with File::Find and then leave that sub alone
asap
The processing is then done in the results of that gathering

My script left the processing within sub wanted
This could possible be a reason that complications arose so quickly

To get file names and sizes at the same time, I am also considering

my %files;

sub wanted {
my $filesize = (stat($_))[7];
push @{$files{$filesize}}, $File::Find::name;
}

find(\wanted, $path);

to hash files and file size results together - then process after

And yep, Igor 

Re: File Size Script Help - Working Version

2011-12-30 Thread Igor Dovgiy
Great work, Jonathan!
Notice how simple your script has become - and that's a good sign as well
in Perl. :) We can make it even simpler, however.

As you probably know, Perl has two fundamental types of collections: arrays
(where data is stored as a sequence of elements, data chunks) and hashes
(where data chunks are unordered, but stored with some unique key used to
retrieve it). Sometimes hashes are used just to sort out (non-)unique data,
but that's another story.

Now look at this line:
   push @{$files{$filesize}}, $File::Find::name;

Don't you see something... weird? You're using hash where filesizes are the
keys - and because, yes, they may well be non-unique, you have to store
arrays of filenames in your hash instead...

But much more natural (at least, for me) is to organize your hash (let's
call it %filedata) so that filenames (which are unique by their nature)
become the keys. And some info about these files - sizes and md5-hashes -
become the values.

For example, our `wanted` (btw, its name is misleading a bit, no? may be
'process' will sound better?) sub may look as follows:

find(\wanted, $path);

my %filedata;
sub wanted {
  return if substr($_, 0, 1) eq '.' || -d $_;
  my $filesize = -s _;
  open my $fh, '', $_ or die $!, $/;
  my $filemd5 = Digest::MD5-new-addfile($fh)-hexdigest;
  close $fh;
  $filedata{$_} = [$filesize, $filemd5];
}

(*Notice how you don't have to declare the global filedata hash before the
callback function is called in `find`? It's really interesting topic*)

Then you'll just have to iterate over the %filedata - and it's as easy as
writing...

for my $filename (keys %filedata) {
  my ($size, $md5) = @{ $filedata{$filename} };
  open my $fh, '', File::Spec-catfile($path, $filename.md5)
or die $!, $/;
  print $fh $filename\t$size bytes\t$md5\n;
  close $fh;
}

... yep, that easy. )

-- iD

P.S. Ordering the output should be an easy task for you; hint - look up
'sort' documentation - or just use sort system routine. :)

2011/12/31 Jonathan Harris jtnhar...@googlemail.com

 HI Brandon

 Thanks for your response


 I totally agree with your first point
 Having now used File::Find a little more, I have seen that using opendir
 was totally unneccessary and have removed them from the script
 And guess what.it works fine without them!
 I think that my initial confusion arose from fundamentally misunderstanding
 File::Find - thinking that it required a handle, not just a path.

 I have also now exchanged the while loop with a foreach loop - much better!

  I assume you meant `hexdigest' here, not 'hex digest'.

 You assume correctly! Gmail has started to do annoying auto text complete -
 I must turn it off!!

  push @files, $File::Find::name if -f $_;

 This is nice and clean

 Your approach is different to what we have been discussing
 You seem to gather the files with File::Find and then leave that sub alone
 asap
 The processing is then done in the results of that gathering

 My script left the processing within sub wanted
 This could possible be a reason that complications arose so quickly

 To get file names and sizes at the same time, I am also considering

 my %files;

 sub wanted {
my $filesize = (stat($_))[7];
push @{$files{$filesize}}, $File::Find::name;
 }

 find(\wanted, $path);

 to hash files and file size results together - then process after

 And yep, Igor has been thorough and very helpful

 Thanks again for your input on this - hope you manage to get some sleep!

 All the best

 Jonathan



Re: File Size Script Help - Working Version

2011-12-30 Thread John W. Krahn

Igor Dovgiy wrote:

Great work, Jonathan!
Notice how simple your script has become - and that's a good sign as well
in Perl. :) We can make it even simpler, however.

As you probably know, Perl has two fundamental types of collections: arrays
(where data is stored as a sequence of elements, data chunks) and hashes
(where data chunks are unordered, but stored with some unique key used to
retrieve it). Sometimes hashes are used just to sort out (non-)unique data,
but that's another story.

Now look at this line:

   push @{$files{$filesize}}, $File::Find::name;


Don't you see something... weird? You're using hash where filesizes are the
keys - and because, yes, they may well be non-unique, you have to store
arrays of filenames in your hash instead...

But much more natural (at least, for me) is to organize your hash (let's
call it %filedata) so that filenames (which are unique by their nature)
become the keys. And some info about these files - sizes and md5-hashes -
become the values.


Yes, file names in a given directory _have_ to be unique, however...



For example, our `wanted` (btw, its name is misleading a bit, no? may be
'process' will sound better?) sub may look as follows:

find(\wanted, $path);

my %filedata;
sub wanted {
   return if substr($_, 0, 1) eq '.' || -d $_;
   my $filesize = -s _;
   open my $fh, '', $_ or die $!, $/;
   my $filemd5 = Digest::MD5-new-addfile($fh)-hexdigest;
   close $fh;
   $filedata{$_} = [$filesize, $filemd5];


You are traversing a directory tree, so using $_ as the key may cause 
collisions across different directories.  Better to use 
$File::Find::name which contains the full absolute path name.




John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.   -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: File Size Script Help - Working Version

2011-12-29 Thread Jonathan Harris
On Thu, Dec 29, 2011 at 5:08 PM, Igor Dovgiy ivd.pri...@gmail.com wrote:

 Hi Jonathan,

 Let's review your script a bit, shall we? )
 It's definitely good for a starter, but still has some rough places.

  #!/usr/bin/perl
 # md5-test.plx
 use warnings;
 use strict;

 use File::Find;

 use Digest::MD5;
 use File::Spec;

 So far, so good. )


 my $dir = shift || '/Users/jonharris/Documents/begperl';

  Nice touch, setting up a default param. ) The name of variable might seem
 too generic to some, but then again, it's the only directory we deal with,
 so...


 my ($dircontents, $path, @directory, $fh, $wr_fh);

 Incoming!

 Well, it's usually better to declare your variables right before you'll
 really need them...
 Your script is short, so you'll hardly have a chance to forget what $fh
 and $wr_fh mean, though. )


 @directory = $dir;
 $path = File::Spec-catfile( @directory, $dircontents );

 Ahem. At least three 'wtf' moments for me. )
 First of all, File::Spec-catfile is really just a glorified join operator
 with some additional operations depending on which system you're using.
 So, second, it makes a little sense to convert $dir into @directory
 (documentation example is just that, an example) and to pass there
 undefined $dircontents as well.
 But the major one is why you'd ever have to pass your $dir through
 File::Spec? It's, you know, user input... )

 opendir (my $in, $path) or die Cannot open $dir: $!\n;

 So you're trying to open $path, but warn about failure to open $dir? )
 But then again, that's a minor quarrel, considering this:


 find (\wanted, $path);

 See, File::Find is convenient method which _emulates_ the whole
 opendir-readdir-closedir pattern for a given directory.
 The 'wanted' subroutine (passed by ref) will be called for each file found
 in $path.
 It's described really well in perldoc (perldoc File::Find).

 close $in;

 Opendir, but close - and not closedir? Now I'm confused. )

 opendir (my $totalin, $path) or die Cannot open $dir: $!\n;
 find (\cleanup, $path);
 close $totalin;

 You don't have to use different variable to store temporary file handle
 (which will be closed in three lines) - and that will save you a few
 moments spent working out a new (but rational) name for it. :)
 But then again, you don't need to open the same dir twice: you can call
 cleanup (with the same 'find (\cleanup)... ' syntax) whenever you want.
 And you don't really need cleanup... whoops, going too far too soon. :)

 print \n\nAll Done!\n\n;


 sub wanted {
 while ($dircontents = readdir($in)) {

 Did I say that you're using two alternative methods of doing the same
 thing? )
 But there's another big 'no-no' here: you're using external variable
 ($dircontents) when you really have perfectly zero reasons to do so.
 Of course, you don't need to push dirhandle ($in) from outer scape into
 this sub, when using find... ($File::Find::dir will do), but that's
 explainable at least. )

  if ($dircontents=~/^\./ || -d $dircontents) {
 next;
 }

 So now the script ignores all the files which names begin with '.', and
 you really wanted just to ignore '.' and '..' ... )

  my $bytes = -s $dircontents;

  print $dircontents, \n;
 print $bytes,  bytes, \tSo far so good!\n;

 Yes. )

  open $fh, , $dircontents or die $!;
 open $wr_fh, +, $path $dircontents.md5 or die $!;  ## was unable to
 concatenate here, hence sub cleanup to remove the ' '

 What was wrong with ...
 open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
 die $!, $/
 ?


 my $hex = Digest::MD5-new-addfile($fh)-hex digest;
 print Hex Digest: , $hex, \n\n;
  print $wr_fh $hex, \n, $bytes, \n\n;

 All looks great for now: you're calculating md5 and size, and writing it
 into file with md5 extension...

 return($hex);

 ... but now you're abruptly jumping out of the while block, making the
 whole point of cycle, well, completely pointless. Not great.

 close $wr_fh;
 close $fh;

 }
 }




 # The following is mostly not original code - thanks to the author!

 sub cleanup {
 my @filelist = readdir($totalin);
  foreach my $oldname (@filelist) {
 next if -d $oldname;
 my $newname = $oldname;
  $newname =~ s/\ //;

 So you don't have spaces in your filenames. Great for you. )

  rename $oldname, $newname;
 }
 }


 # End #


 And here we finish.

 Computers are not smart. They're dumb. But they're fast. And obedient. )
 That's why they're really helpful in letting you do what you're trying to
 do... but only if you KNOW what you're trying to do.

 Imagine that you're - and not your computer - will be doing this task.
 Sit in one place - and write down your program as you (and not your
 computer) will be running it. Step by step. Bit by bit.

 Then convert your notes into some Perl form - and you'll instantly see the
 difference between now and then. )

 -- iD


Hi Igor

Many thanks for your response

I have started reviewing the things you said
There are some silly mistakes in there - eg not using closedir
It's a good 

Re: File Size Script Help - Working Version

2011-12-29 Thread Igor Dovgiy
Hi Jonathan,

Let's review your script a bit, shall we? )
It's definitely good for a starter, but still has some rough places.

#!/usr/bin/perl
 # md5-test.plx
 use warnings;
 use strict;

use File::Find;

use Digest::MD5;
 use File::Spec;

So far, so good. )


 my $dir = shift || '/Users/jonharris/Documents/begperl';

Nice touch, setting up a default param. ) The name of variable might seem
too generic to some, but then again, it's the only directory we deal with,
so...


 my ($dircontents, $path, @directory, $fh, $wr_fh);

Incoming!

Well, it's usually better to declare your variables right before you'll
really need them...
Your script is short, so you'll hardly have a chance to forget what $fh and
$wr_fh mean, though. )


 @directory = $dir;
 $path = File::Spec-catfile( @directory, $dircontents );

Ahem. At least three 'wtf' moments for me. )
First of all, File::Spec-catfile is really just a glorified join operator
with some additional operations depending on which system you're using.
So, second, it makes a little sense to convert $dir into @directory
(documentation example is just that, an example) and to pass there
undefined $dircontents as well.
But the major one is why you'd ever have to pass your $dir through
File::Spec? It's, you know, user input... )

opendir (my $in, $path) or die Cannot open $dir: $!\n;

So you're trying to open $path, but warn about failure to open $dir? )
But then again, that's a minor quarrel, considering this:


 find (\wanted, $path);

See, File::Find is convenient method which _emulates_ the whole
opendir-readdir-closedir pattern for a given directory.
The 'wanted' subroutine (passed by ref) will be called for each file found
in $path.
It's described really well in perldoc (perldoc File::Find).

close $in;

Opendir, but close - and not closedir? Now I'm confused. )

opendir (my $totalin, $path) or die Cannot open $dir: $!\n;
 find (\cleanup, $path);
 close $totalin;

You don't have to use different variable to store temporary file handle
(which will be closed in three lines) - and that will save you a few
moments spent working out a new (but rational) name for it. :)
But then again, you don't need to open the same dir twice: you can call
cleanup (with the same 'find (\cleanup)... ' syntax) whenever you want.
And you don't really need cleanup... whoops, going too far too soon. :)

print \n\nAll Done!\n\n;


 sub wanted {
 while ($dircontents = readdir($in)) {

Did I say that you're using two alternative methods of doing the same
thing? )
But there's another big 'no-no' here: you're using external variable
($dircontents) when you really have perfectly zero reasons to do so.
Of course, you don't need to push dirhandle ($in) from outer scape into
this sub, when using find... ($File::Find::dir will do), but that's
explainable at least. )

 if ($dircontents=~/^\./ || -d $dircontents) {
 next;
 }

So now the script ignores all the files which names begin with '.', and you
really wanted just to ignore '.' and '..' ... )

my $bytes = -s $dircontents;

print $dircontents, \n;
 print $bytes,  bytes, \tSo far so good!\n;

Yes. )

  open $fh, , $dircontents or die $!;
 open $wr_fh, +, $path $dircontents.md5 or die $!;  ## was unable to
 concatenate here, hence sub cleanup to remove the ' '

What was wrong with ...
open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
die $!, $/
?


 my $hex = Digest::MD5-new-addfile($fh)-hex digest;
 print Hex Digest: , $hex, \n\n;
  print $wr_fh $hex, \n, $bytes, \n\n;

All looks great for now: you're calculating md5 and size, and writing it
into file with md5 extension...

 return($hex);

... but now you're abruptly jumping out of the while block, making the
whole point of cycle, well, completely pointless. Not great.

 close $wr_fh;
 close $fh;

}
 }




 # The following is mostly not original code - thanks to the author!

sub cleanup {
 my @filelist = readdir($totalin);
  foreach my $oldname (@filelist) {
 next if -d $oldname;
 my $newname = $oldname;
  $newname =~ s/\ //;

So you don't have spaces in your filenames. Great for you. )

 rename $oldname, $newname;
 }
 }


# End #


And here we finish.

Computers are not smart. They're dumb. But they're fast. And obedient. )
That's why they're really helpful in letting you do what you're trying to
do... but only if you KNOW what you're trying to do.

Imagine that you're - and not your computer - will be doing this task.
Sit in one place - and write down your program as you (and not your
computer) will be running it. Step by step. Bit by bit.

Then convert your notes into some Perl form - and you'll instantly see the
difference between now and then. )

-- iD


re: File Size Script Help - Working Version

2011-12-29 Thread Jonathan Harris
Hi All

Firstly, many thanks for your help previously (19/12/11) - it has led to
making a useable script
I don't think it's brilliantly written, it seems a little bodged together
to me... but works fine - not a bad result for a first script

If you are new to this problem and are interested in the previous thread, I
have attached it for you as a text file

I have done everything I can think of now to follow the previous advice
The script is portable, skips directories, creates digests of files, uses
better Perl practice (e.g., no more barewords, correct lexical file handles
etc)

I only have a couple of questions left - wondering if you can help :

- One thing that was recommended was to ensure that the File Handles are
opened outside of the loop
- I really can't figure out how to do this and keep the program working!
Doesn't it need to be inside the loop to be iterated over?

- Also, I had to open two file handles to get addfile()-hexdigest to work
so that the value could be passed - this can't be correct?!

- Writing back was messy - I was struggling with concatenating variables to
keep the script portable
- Is it possible to put values into a hash, and then print each hash entry
to a separate file?


There are clearly better ways to achieve this result - all suggestions are
gratefully received!

Thanks again

Jon


Here's the script:

#

# This program reads in files from a directory, produces a hex digest and
writes the hex, along with
# the file size into a newly created file with the same name and a '.md5'
extension, to the original directory

#!/usr/bin/perl
# md5-test.plx
use warnings;
use strict;

use File::Find;
use Digest::MD5;
use File::Spec;

my $dir = shift || '/Users/jonharris/Documents/begperl';
my ($dircontents, $path, @directory, $fh, $wr_fh);

@directory = $dir;
$path = File::Spec-catfile( @directory, $dircontents );

opendir (my $in, $path) or die Cannot open $dir: $!\n;
find (\wanted, $path);
close $in;

opendir (my $totalin, $path) or die Cannot open $dir: $!\n;
find (\cleanup, $path);
close $totalin;

print \n\nAll Done!\n\n;

sub wanted {
 while ($dircontents = readdir($in)) {
if ($dircontents=~/^\./ || -d $dircontents) {
next;
}
my $bytes = -s $dircontents;
print $dircontents, \n;
print $bytes,  bytes, \tSo far so good!\n;
 open $fh, , $dircontents or die $!;
open $wr_fh, +, $path $dircontents.md5 or die $!;  ## was unable to
concatenate here, hence sub cleanup to remove the ' '

my $hex = Digest::MD5-new-addfile($fh)-hex digest;
print Hex Digest: , $hex, \n\n;
print $wr_fh $hex, \n, $bytes, \n\n;
return($hex);

close $wr_fh;
close $fh;
}
}

# The following is mostly not original code - thanks to the author!
sub cleanup {
my @filelist = readdir($totalin);
foreach my $oldname (@filelist) {
next if -d $oldname;
my $newname = $oldname;
$newname =~ s/\ //;
rename $oldname, $newname;
}
}

# End #

Jonathan Harris
Dec 19 (10 days ago)

to beginners, me 
Hi Perl Pros

This is my first call for help

I am a totally new, self teaching, Perl hopeful

If my approach to this script is simply wrong, please let me know as it will 
help my learning!

The script aims to:

1) Read in a directory either from the command line, or from a default path
2) Produce a hash for future checksum
3) Write this (hex digest) to a separate file, in a sub directory of the 
parent, which has the same name and a .md5 extension
4) Check the original file for its size
5) Add this data to the newly created file on a new line (in bytes)

I have a script that will do most of this, except for analysing the file size - 
I think that the file size being analysed may be the md5 object result as the 
same value is printed to each file

I am running out of ideas and would appreciate any help you could give!
I have tried using File::stat::OO and File::stat  - but to no avail - I could 
be using them incorrectly!

Many thanks in advance...

Jon.

Here are some details:

System:
Mac OSX 10.7.2
Perl version 5.12

Script:

#!/usr/bin/perl
# md5-test-3.plx
use warnings;
use strict;
use Digest::MD5;

my $filesize = 0;
my $dir = shift || '/Users/jonharris/Documents/begperl';
opendir (DH, $dir) or die Couldn't open directory: $!;

my $md5 = Digest::MD5-new;
while ($_ = readdir(DH)) {
$md5-add($_);
$filesize = (stat(DH))[7];

 Is it necessary to put the following into a new loop?

foreach ($_) {
open FH,  /Users/jonharris/Documents/begperl/md5/$_.md5 or 
die $!;
binmode(FH);
print FH $md5-hexdigest, \n, $filesize;
}
close FH;
}
close DH;
print \n$dir\n\n;

###

Jim Gibson via perl.org 
Dec 19 (10 days ago)

to beginners 
On 12/19/11 Mon  Dec 19, 2011  11:32 AM, Jonathan Harris
jtnhar...@googlemail.com scribbled:

 Hi Perl Pros

 This is my first call for help

 I am a totally new, self teaching, Perl hopeful

 If my approach to this script is simply wrong, please let me know as it
 will help my learning!

 The 

Re: File Size Script Help - Working Version

2011-12-29 Thread John W. Krahn

Jonathan Harris wrote:


Hi Igor

Many thanks for your response

I have started reviewing the things you said
There are some silly mistakes in there - eg not using closedir
It's a good lesson in script vigilance

I found the part about opening the file handle particularly interesting
I had no idea that

open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
die $!, $/

was possible

Now it's time to sit down and digest all this..and rewrite the script
to make it better!


Igor made a lot of good points.  Here are my two cents worth.  You are 
using the File::Find module to traverse the file system and add new 
files along the way.  This _may_ cause problems on some file systems. 
It would probably be better to get a list of applicable files first and 
then use that list to create your new files.  And you should have some 
way to handle the situation where a file exists that already has an 
'.md5' file or an '.md5' file exists with no corresponding plain file.




John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.   -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: File Size Script Help - Working Version

2011-12-29 Thread Jonathan Harris
On Thu, Dec 29, 2011 at 6:39 PM, John W. Krahn jwkr...@shaw.ca wrote:

 Jonathan Harris wrote:


 Hi Igor

 Many thanks for your response

 I have started reviewing the things you said
 There are some silly mistakes in there - eg not using closedir
 It's a good lesson in script vigilance

 I found the part about opening the file handle particularly interesting
 I had no idea that

 open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
 die $!, $/

 was possible

 Now it's time to sit down and digest all this..and rewrite the script
 to make it better!


 Igor made a lot of good points.  Here are my two cents worth.  You are
 using the File::Find module to traverse the file system and add new files
 along the way.  This _may_ cause problems on some file systems. It would
 probably be better to get a list of applicable files first and then use
 that list to create your new files.  And you should have some way to handle
 the situation where a file exists that already has an '.md5' file or an
 '.md5' file exists with no corresponding plain file.



 John
 --
 Any intelligent fool can make things bigger and
 more complex... It takes a touch of genius -
 and a lot of courage to move in the opposite
 direction.   -- Albert Einstein

 --
 To unsubscribe, e-mail: beginners-unsubscr...@perl.org
 For additional commands, e-mail: beginners-h...@perl.org
 http://learn.perl.org/



Hi John

Thanks for your 2 cents

I hadn't considered that the module wouldn't be portable
If that is the case, then maybe it would be best to
ditch File::Find altogether?
Have you had experience with the module causing issues on certain systems?

It would be a shame, as I've just got it working! - Thanks to Igor, I no
longer use the unnecessary dir handles!

I agree that it may be worth examining the directory for existing .md5
files and skipping them
I'll look in to adding that in to the code

All the best and thanks for your help

Jonathan


Re: File Size Script Help - Working Version

2011-12-29 Thread Jonathan Harris
On Fri, Dec 30, 2011 at 12:33 AM, Jonathan Harris
jtnhar...@googlemail.comwrote:



 On Thu, Dec 29, 2011 at 6:39 PM, John W. Krahn jwkr...@shaw.ca wrote:

 Jonathan Harris wrote:


 Hi Igor

 Many thanks for your response

 I have started reviewing the things you said
 There are some silly mistakes in there - eg not using closedir
 It's a good lesson in script vigilance

 I found the part about opening the file handle particularly interesting
 I had no idea that

 open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
 die $!, $/

 was possible

 Now it's time to sit down and digest all this..and rewrite the script
 to make it better!


 Igor made a lot of good points.  Here are my two cents worth.  You are
 using the File::Find module to traverse the file system and add new files
 along the way.  This _may_ cause problems on some file systems. It would
 probably be better to get a list of applicable files first and then use
 that list to create your new files.  And you should have some way to handle
 the situation where a file exists that already has an '.md5' file or an
 '.md5' file exists with no corresponding plain file.



 John
 --
 Any intelligent fool can make things bigger and
 more complex... It takes a touch of genius -
 and a lot of courage to move in the opposite
 direction.   -- Albert Einstein

 --
 To unsubscribe, e-mail: beginners-unsubscr...@perl.org
 For additional commands, e-mail: beginners-h...@perl.org
 http://learn.perl.org/



 Hi John

 Thanks for your 2 cents

 I hadn't considered that the module wouldn't be portable
 If that is the case, then maybe it would be best to
 ditch File::Find altogether?
 Have you had experience with the module causing issues on certain systems?

 It would be a shame, as I've just got it working! - Thanks to Igor, I no
 longer use the unnecessary dir handles!

 I agree that it may be worth examining the directory for existing .md5
 files and skipping them
 I'll look in to adding that in to the code

 All the best and thanks for your help

 Jonathan


Hi All

Final question for Igor

I tried to use your suggestion

open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or
die $!, $/


but it returned an error on the command line:

'Not a directory'

At which point the program dies (which is what it is supposed to do!)

I used it inside the loop - sorry to bug you for clarification





if ($dircontents=~/^\./ || -d $dircontents) {

next;

}


This is also to avoid the file .DS_Store


#


FInally, I was advised by a C programmer to declare all variables at the
start of a program to avoid memory issues

Is this not necessary in Perl?


The rest of it is going really well - hope to post new and improved code
soon!


Re: File Size Script Help - Working Version

2011-12-29 Thread John W. Krahn

Jonathan Harris wrote:

On Thu, Dec 29, 2011 at 6:39 PM, John W. Krahnjwkr...@shaw.ca  wrote:


Igor made a lot of good points.  Here are my two cents worth.  You are
using the File::Find module to traverse the file system and add new files
along the way.  This _may_ cause problems on some file systems. It would
probably be better to get a list of applicable files first and then use
that list to create your new files.  And you should have some way to handle
the situation where a file exists that already has an '.md5' file or an
'.md5' file exists with no corresponding plain file.



Hi John

Thanks for your 2 cents

I hadn't considered that the module wouldn't be portable


That is not what I was implying.  I was saying that when you add new 
files to a directory that you are traversing you _may_ get irregular 
results.  It depends on how your operating system updates directory entries.




If that is the case, then maybe it would be best to
ditch File::Find altogether?
Have you had experience with the module causing issues on certain systems?

It would be a shame, as I've just got it working! - Thanks to Igor, I no
longer use the unnecessary dir handles!

I agree that it may be worth examining the directory for existing .md5
files and skipping them
I'll look in to adding that in to the code




John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.   -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/




Re: File Size Script Help - Working Version

2011-12-29 Thread John W. Krahn

Jonathan Harris wrote:


FInally, I was advised by a C programmer to declare all variables at the
start of a program to avoid memory issues

Is this not necessary in Perl?


It is not really necessary in C either.



John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction.   -- Albert Einstein

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/