Re: File Size Script Help - Working Version
Hi folks, happy new year to everyone. ) John, you're right, of course. ) The filenames in nested directories could well overlap, and using $File::Find::name would be safer. Didn't think of that as a big problem, though, as original script (with 'opendir') ignored all the nested folders overall. Jonathan, no, you don't have to store the filenames as array: complete pathnames of the file can't be repeated. It'll be sufficient just to change this line: $filedata{$_} = [$filesize, $filemd5] for $filedata{$File::Find::name} = [$filesize, $filemd5] (and replace catfile in the writing block as well, as %filedata keys will now be the whole filenames themselves). On sorting: cmp and = is not the same: former compares strings, latter - numbers. So, for example, 'abc' cmp 'def' gives you -1, but 'abc' = 'def' gives 0 (and warnings about non-numeric args as well). It's nice to know the difference, but... do you really need to sort the output in your script? What output? ) It makes no difference in what order your .md5 files will be created, right? And you don't need to print the list of files processed (as I did in my test script, that's why the ordering was ever mentioned). As for $_, the problem John mentioned is logical, not 'perlical': as $_ variable is assigned a filename processed within File::Find target sub, and files in different directories could have the same names (but not full names, with absolute path attached), it may cause a bit of confusion when they DO have the same names. ) Generally speaking, $_ usage is for comfort and speed (yes, thinking of it as of 'it' word is right )). Of course, you can reassign it, but it'll make your scripts bigger (sometimes _much_ bigger) without adding much clarity, in my opinion. But that, again, is a matter of taste. For me, I use $_ almost every time I process shallow collections (hashes or arrays, doesn't matter). When two-level (or more complex) data structure is processed, it's usually required to use a temporary variable - but even then inner layers can be iterated with $_ easily. -- iD
Re: File Size Script Help - Working Version
Hello: On Sat, Dec 31, 2011 at 02:56:50AM +0200, Igor Dovgiy wrote: $filedata{$_} = [$filesize, $filemd5]; *snip* my ($size, $md5) = @{ $filedata{$filename} }; Alternatively, store a nested hash-reference: $filedata{$File::Find::name} = { md5 = $file_md5, size = $file_size, }; # ... my ($size, $md5) = @{$filedata{$filename}}{qw/size md5/}; That way you don't need to remember which [arbitrary] order they're in, and if the order changes, or more fields are added, the meaning of subsequent code doesn't change. Regards, -- Brandon McCaig bamcc...@gmail.com bamcc...@castopulence.org Castopulence Software https://www.castopulence.org/ Blog http://www.bamccaig.com/ perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say' signature.asc Description: Digital signature
Re: File Size Script Help - Working Version
On Sat, Dec 31, 2011 at 4:29 AM, John W. Krahn jwkr...@shaw.ca wrote: Igor Dovgiy wrote: Great work, Jonathan! Notice how simple your script has become - and that's a good sign as well in Perl. :) We can make it even simpler, however. As you probably know, Perl has two fundamental types of collections: arrays (where data is stored as a sequence of elements, data chunks) and hashes (where data chunks are unordered, but stored with some unique key used to retrieve it). Sometimes hashes are used just to sort out (non-)unique data, but that's another story. Now look at this line: push @{$files{$filesize}}, $File::Find::name; Don't you see something... weird? You're using hash where filesizes are the keys - and because, yes, they may well be non-unique, you have to store arrays of filenames in your hash instead... But much more natural (at least, for me) is to organize your hash (let's call it %filedata) so that filenames (which are unique by their nature) become the keys. And some info about these files - sizes and md5-hashes - become the values. Yes, file names in a given directory _have_ to be unique, however... For example, our `wanted` (btw, its name is misleading a bit, no? may be 'process' will sound better?) sub may look as follows: find(\wanted, $path); my %filedata; sub wanted { return if substr($_, 0, 1) eq '.' || -d $_; my $filesize = -s _; open my $fh, '', $_ or die $!, $/; my $filemd5 = Digest::MD5-new-addfile($fh)**-hexdigest; close $fh; $filedata{$_} = [$filesize, $filemd5]; You are traversing a directory tree, so using $_ as the key may cause collisions across different directories. Better to use $File::Find::name which contains the full absolute path name. John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/ Hi to all on the list still following this thread - and Happy New Year! Igor...Thanks!! : ) It does feel like there has been some really good Perl learning progress being made here - and yep, I cannot believe how trimmed down the script has now become. Looking back on the original script makes me laugh! I wonder if that will become a consistent theme when writing?! Looking back to the hash - I agree that it makes far more sense to have the filenames as the keys Quoting yourself and John: filenames (which are unique by their nature) Yes, file names in a given directory _have_ to be unique. I think that we can all be in agreement then that these entries should be guaranteed to have unique keys and can have non-unique data such as file size attributed to them- therefore: push @{$files{$filename}}, $File::Find::name; When sorting the hash, there seems to well established code for this eg: sorting by file size: foreach (sort {$filedata{$b} = $filedata{$a}} keys %filedata { ## should sort so that the highest value file size is first ... } As far as I'm aware, = and cmp are the same thing Is there a question of precedence over them? I assume that = has a higher precedence Interestingly, this is now the second time in this thread that we have been warned against using $_ From John: $_ as the key may cause collisions across different directories From Shlomi: The $_ variable can be easily devastated. You should use a lexical one. I believe that I understand the use of $_ as the default variable; indeed, the documentation on CPAN about File::Find states the usage of $_ in the module However, it seems that it is a variable that it so easily destroyed and many have warned against using it If this is the case, why would we choose (or be required) to use it in the first place? I have read the 'Elements to Avoid' page, as recommended by Shlomi http://perl-begin.org/tutorials/bad-elements/ which is very useful Would it be correct to say that $_ should be re-assigned asap whenever using Perl? I couldn't find any exceptions that state that it is ok to use it # Sincere thanks again to you all for your contributions I hope that others reading this list are learning as much as I am! All the best Jonathan
Re: File Size Script Help - Working Version
Hi Jonathan, Argh, really stupid mistake by me. ) But let's use it to explain some points a bit further, shall we? A skilled craftsman knows his tools well, and Perl programmer (with CPAN as THE collection of tools of all sizes and meanings) has an advantage here: even if documentation is a bit vague about what's going on, we are (usually) able to check the code itself to find the answers. ) By browsing File::Spec source (found via 'Source' link within the 'File::Spec' page at CPAN)... http://cpansearch.perl.org/src/SMUELLER/PathTools-3.33/lib/File/Spec.pm ...we soon discover that this module is essentially an adapter for modules like File::Spec::Unix, File::Spec::Mac, File::Spec::Win32 etc. So our search goes on (as your mention of .DS_Store file implies) over there: http://cpansearch.perl.org/src/SMUELLER/PathTools-3.33/lib/File/Spec/Mac.pm Now we may either check the documentation (which clearly states that only the last argument to catfile is considered a filename, and all the others will be concatenated with catdir), or look right into the code - and come to the same conclusion: sub catfile { my $self = shift; return '' unless @_; my $file = pop @_; return $file unless @_; my $dir = $self-catdir(@_); $file =~ s/^://s; return $dir.$file; } So what should we do now? ) Of course, give milk to our cat... and arguments to File::Spec's catfile! ) Like this: File::Spec-catfile($path, $dircontents . '.md5') ... or this... File::Spec-catfile($path, $dircontents.md5) (check 'variable interpolation in Perl' to see why it's possible - and why this is essentially the same as previous codeline) ... or even this ... File::Spec-catfile($path, join '.', $dircontents, 'md5') (but that would be a bit overkill, of course :) Speaking of overkills: you used regex (=~ /^\./) to check whether the line begins with a dot - or not. ) It's ok for this task, but you probably should know that these checks may be also done with (substr($line, 0, 1) eq '.') code, which will be a bit (up to 30% at my PC when Benchmark'ed) faster. -- iD 2011/12/30 Jonathan Harris jtnhar...@googlemail.com I tried to use your suggestion open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ but it returned an error on the command line: 'Not a directory' At which point the program dies (which is what it is supposed to do!) I used it inside the loop - sorry to bug you for clarification if ($dircontents=~/^\./ || -d $dircontents) { next; } This is also to avoid the file .DS_Store
Re: File Size Script Help - Working Version
Hi John, yes, good point! Totally forgot this. ) Adding new files to a directory as you browse it is just not right, of course. Possible, but not right. ) I'd solve this by using hash with filenames as keys and collected 'result' strings (with md5 and filesizes) as values, filled by File::Find target routine. After the whole directory is processed, this hash should be 'written out' into the target directory. Another way to do it is to collect all the filenames instead into a list (with glob operator, for example), and process this list after. BTW (to Jonathan), I wonder do you really need to store this kind of data in different files? No offence... but I can hardly imagine how this data will be used later unless gathered into some array or hash. ) -- iD 2011/12/30 John W. Krahn jwkr...@shaw.ca Jonathan Harris wrote: Hi John Thanks for your 2 cents I hadn't considered that the module wouldn't be portable That is not what I was implying. I was saying that when you add new files to a directory that you are traversing you _may_ get irregular results. It depends on how your operating system updates directory entries.
Re: File Size Script Help - Working Version
On Fri, Dec 30, 2011 at 11:58 AM, Igor Dovgiy ivd.pri...@gmail.com wrote: Hi John, yes, good point! Totally forgot this. ) Adding new files to a directory as you browse it is just not right, of course. Possible, but not right. ) I'd solve this by using hash with filenames as keys and collected 'result' strings (with md5 and filesizes) as values, filled by File::Find target routine. After the whole directory is processed, this hash should be 'written out' into the target directory. Another way to do it is to collect all the filenames instead into a list (with glob operator, for example), and process this list after. BTW (to Jonathan), I wonder do you really need to store this kind of data in different files? No offence... but I can hardly imagine how this data will be used later unless gathered into some array or hash. ) -- iD 2011/12/30 John W. Krahn jwkr...@shaw.ca Jonathan Harris wrote: Hi John Thanks for your 2 cents I hadn't considered that the module wouldn't be portable That is not what I was implying. I was saying that when you add new files to a directory that you are traversing you _may_ get irregular results. It depends on how your operating system updates directory entries. Hi All John - Thanks for the clarification In this instance the script has been run on OSX - it seems that adding the files into the directory that is being traversed works ok this time However for best practice, I would certainly look into writing to a separate directory, and then moving the files back, as I appreciate that this fortune may not necessarily be repeated in a different environment! Igor - Firstly - File::Spec Thanks for your insight and well explained investigation - I have been learning a lot from this File::Spec has proven a most useful tool in joining and 'stringifying' the paths In the original post about this script, I had spoken about considering using a hash for the file data I'm still convinced that ultimately, this would be the way forwards I have found some scripts online concerning finding duplicate files They use md5 and/or file sizes to compare the files These are written into hashes Fully understanding some of these scripts is a little beyond my level at the moment I have attached an interesting one for you to look at (you may be aware of it already!) However, it has proved quite inspiring! (substr($line, 0, 1) eq '.') Haven't learned this yet! It looks like a good solution if it is so much more efficient - thanks for the introduction - I'll be reading up asap! BTW (to Jonathan), I wonder do you really need to store this kind of data in different files? No offence... but I can hardly imagine how this data will be used later unless gathered into some array or hash. ) There is a good reason for this! Talking to guys who work in video on demand, it seems that it is standard practice to do this for file delivery requirements As each video file must be identical upon receipt as it was upon delivery ( and that the files are all treated as unique delivery instances ) a separate accompanying file is required I thought that Perl would be a good choice for accomplishing this requirement as it is renowned for file handling # Thanks to everyone for your help and contributions - particularly Jim, Shlomi, John and Igor I have learned crazy amounts already! Happy New Year to you all! Jonathan finddupes3.plx Description: Binary data -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: File Size Script Help - Working Version
On Thu, Dec 29, 2011 at 03:43:19PM +, Jonathan Harris wrote: Hi All Hello Jonathan: (Disclaimer: I stayed up all night playing Skyrim and am running on about 4.5 hours of sleep.. ^_^) I think most things have already been addressed, but I think Igor might have had a bit of trouble making it clear. opendir (my $in, $path) or die Cannot open $dir: $!\n; find (\wanted, $path); close $in; opendir (my $totalin, $path) or die Cannot open $dir: $!\n; find (\cleanup, $path); close $totalin; AFAICT, it's completely nonsensical to open a directory file handle surrounding File::Find::find. I tried to /search the perldoc (just in case there's some kind of magical optimization or something) and saw no mention of 'opendir' or 'handle' (except for the special _ file handle created for stat, lstat, etc..). So it seems $in and $totalin are completely unnecessary here: File::Find will worry about opening and processing the directories for you. sub wanted { while ($dircontents = readdir($in)) { I guess this is why you are opening directory handles above, but it doesn't really make sense. You're basically only using File::Find to loop at this point, and very obscurely. :) File::Find's role in life is precisely to find you all the files within a directory tree. You're reinventing the square wheel with your use of opendir and readdir. :) The wanted subroutine is typically used to either process the file system tree outright, or store applicable files in data structures for later processing. E.g., use strict; use warnings; use File::Find; my @files; sub wanted { # Skip dot files and directories. return if substr($_, 0, 1) eq '.' || -d $_; # If current file is a normal file, push into array for # later. push @files, $File::Find::name if -f $_; } my $path = '.'; find \wanted, $path; # Now @files should be filled with a recursive list of files to # process. E.g., for my $file (@files) { my $md5name = $file.'md5'; # Etc... } my $hex = Digest::MD5-new-addfile($fh)-hex digest; I assume you meant `hexdigest' here, not 'hex digest'. $newname =~ s/\ //; Ideally if you're going to do something as obscure as this, you should comment it in both places so future readers and maintainers understand why it's done, even if they only read one half of the program. I think Igor else has already explained how to eliminate this obscurity though. :) Regards, -- Brandon McCaig bamcc...@gmail.com bamcc...@castopulence.org Castopulence Software https://www.castopulence.org/ Blog http://www.bamccaig.com/ perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say' signature.asc Description: Digital signature
Re: File Size Script Help - Working Version
On Fri, Dec 30, 2011 at 7:11 PM, Brandon McCaig bamcc...@gmail.com wrote: On Thu, Dec 29, 2011 at 03:43:19PM +, Jonathan Harris wrote: Hi All Hello Jonathan: (Disclaimer: I stayed up all night playing Skyrim and am running on about 4.5 hours of sleep.. ^_^) I think most things have already been addressed, but I think Igor might have had a bit of trouble making it clear. opendir (my $in, $path) or die Cannot open $dir: $!\n; find (\wanted, $path); close $in; opendir (my $totalin, $path) or die Cannot open $dir: $!\n; find (\cleanup, $path); close $totalin; AFAICT, it's completely nonsensical to open a directory file handle surrounding File::Find::find. I tried to /search the perldoc (just in case there's some kind of magical optimization or something) and saw no mention of 'opendir' or 'handle' (except for the special _ file handle created for stat, lstat, etc..). So it seems $in and $totalin are completely unnecessary here: File::Find will worry about opening and processing the directories for you. sub wanted { while ($dircontents = readdir($in)) { I guess this is why you are opening directory handles above, but it doesn't really make sense. You're basically only using File::Find to loop at this point, and very obscurely. :) File::Find's role in life is precisely to find you all the files within a directory tree. You're reinventing the square wheel with your use of opendir and readdir. :) The wanted subroutine is typically used to either process the file system tree outright, or store applicable files in data structures for later processing. E.g., use strict; use warnings; use File::Find; my @files; sub wanted { # Skip dot files and directories. return if substr($_, 0, 1) eq '.' || -d $_; # If current file is a normal file, push into array for # later. push @files, $File::Find::name if -f $_; } my $path = '.'; find \wanted, $path; # Now @files should be filled with a recursive list of files to # process. E.g., for my $file (@files) { my $md5name = $file.'md5'; # Etc... } my $hex = Digest::MD5-new-addfile($fh)-hex digest; I assume you meant `hexdigest' here, not 'hex digest'. $newname =~ s/\ //; Ideally if you're going to do something as obscure as this, you should comment it in both places so future readers and maintainers understand why it's done, even if they only read one half of the program. I think Igor else has already explained how to eliminate this obscurity though. :) Regards, -- Brandon McCaig bamcc...@gmail.com bamcc...@castopulence.org Castopulence Software https://www.castopulence.org/ Blog http://www.bamccaig.com/ perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }. q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.}; tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say' -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAEBAgAGBQJO/gzlAAoJEN2n1gIi5ZPyMJoP/15LI4aelHowMiNQYFzLB2E1 nEAjPHvvHX7uTxLV8BOht9HtoQPpOwyQ/fJm2EIe59NGwjjvlKCdpCrfkfxz3cvV td0wDuUKrjzQPaACl4cNrGITZLVXe6KtZUKG2o3TjA5dlyqbes5d5F40Mh3j/fB5 L0VZpWNvp0cTrJ6x5QfTkvyQPdxKx0ARaFDQYpvR3uKfgeD28ZNatNCQQuuymLkj ABVqLzQgVVMinaj/4xXii5vedgYFI58DPWF7r0nmhUaiVvDAEFzfd1MlSvkuI8Jr EKTQdz3EjMZufRaGxX96rdZvVMEiSTcA/IXNkhis48dOwa4ebfSYm8QpQhp1E6UF QuJBRXzF9cHvD095Aw+MSqoSR+2LZS3SFBfdB9I5rdxJEoS7LMSJdJLY95dXftqR KdlcU7ds4kdqaJrnxrxB05SIhgZNq1JcCk9xZyuk7NxsPwl/6ZStr/E/V2mA+bdD bGWCuP7IKRfZ8cNd01tEYnUppYkfxRaNtYhlNmPVh6TX7rd+2Z4aZLn0anR3Q/J4 2OsPmtIOakk1jW6f41vTOqZ5cpxSv/R1H6fjZAOVACf7UlrtCgVWVLKYofcO/ryS HMsX1T3m3h9H3heSd76FiXMsgDBaMBkti31FzwnS3pOaZEpHk6eHF/dbR2p0sAbA gTvG4xDp2MLEtY03Ofhw =RFiR -END PGP SIGNATURE- HI Brandon Thanks for your response I totally agree with your first point Having now used File::Find a little more, I have seen that using opendir was totally unneccessary and have removed them from the script And guess what.it works fine without them! I think that my initial confusion arose from fundamentally misunderstanding File::Find - thinking that it required a handle, not just a path. I have also now exchanged the while loop with a foreach loop - much better! I assume you meant `hexdigest' here, not 'hex digest'. You assume correctly! Gmail has started to do annoying auto text complete - I must turn it off!! push @files, $File::Find::name if -f $_; This is nice and clean Your approach is different to what we have been discussing You seem to gather the files with File::Find and then leave that sub alone asap The processing is then done in the results of that gathering My script left the processing within sub wanted This could possible be a reason that complications arose so quickly To get file names and sizes at the same time, I am also considering my %files; sub wanted { my $filesize = (stat($_))[7]; push @{$files{$filesize}}, $File::Find::name; } find(\wanted, $path); to hash files and file size results together - then process after And yep, Igor
Re: File Size Script Help - Working Version
Great work, Jonathan! Notice how simple your script has become - and that's a good sign as well in Perl. :) We can make it even simpler, however. As you probably know, Perl has two fundamental types of collections: arrays (where data is stored as a sequence of elements, data chunks) and hashes (where data chunks are unordered, but stored with some unique key used to retrieve it). Sometimes hashes are used just to sort out (non-)unique data, but that's another story. Now look at this line: push @{$files{$filesize}}, $File::Find::name; Don't you see something... weird? You're using hash where filesizes are the keys - and because, yes, they may well be non-unique, you have to store arrays of filenames in your hash instead... But much more natural (at least, for me) is to organize your hash (let's call it %filedata) so that filenames (which are unique by their nature) become the keys. And some info about these files - sizes and md5-hashes - become the values. For example, our `wanted` (btw, its name is misleading a bit, no? may be 'process' will sound better?) sub may look as follows: find(\wanted, $path); my %filedata; sub wanted { return if substr($_, 0, 1) eq '.' || -d $_; my $filesize = -s _; open my $fh, '', $_ or die $!, $/; my $filemd5 = Digest::MD5-new-addfile($fh)-hexdigest; close $fh; $filedata{$_} = [$filesize, $filemd5]; } (*Notice how you don't have to declare the global filedata hash before the callback function is called in `find`? It's really interesting topic*) Then you'll just have to iterate over the %filedata - and it's as easy as writing... for my $filename (keys %filedata) { my ($size, $md5) = @{ $filedata{$filename} }; open my $fh, '', File::Spec-catfile($path, $filename.md5) or die $!, $/; print $fh $filename\t$size bytes\t$md5\n; close $fh; } ... yep, that easy. ) -- iD P.S. Ordering the output should be an easy task for you; hint - look up 'sort' documentation - or just use sort system routine. :) 2011/12/31 Jonathan Harris jtnhar...@googlemail.com HI Brandon Thanks for your response I totally agree with your first point Having now used File::Find a little more, I have seen that using opendir was totally unneccessary and have removed them from the script And guess what.it works fine without them! I think that my initial confusion arose from fundamentally misunderstanding File::Find - thinking that it required a handle, not just a path. I have also now exchanged the while loop with a foreach loop - much better! I assume you meant `hexdigest' here, not 'hex digest'. You assume correctly! Gmail has started to do annoying auto text complete - I must turn it off!! push @files, $File::Find::name if -f $_; This is nice and clean Your approach is different to what we have been discussing You seem to gather the files with File::Find and then leave that sub alone asap The processing is then done in the results of that gathering My script left the processing within sub wanted This could possible be a reason that complications arose so quickly To get file names and sizes at the same time, I am also considering my %files; sub wanted { my $filesize = (stat($_))[7]; push @{$files{$filesize}}, $File::Find::name; } find(\wanted, $path); to hash files and file size results together - then process after And yep, Igor has been thorough and very helpful Thanks again for your input on this - hope you manage to get some sleep! All the best Jonathan
Re: File Size Script Help - Working Version
Igor Dovgiy wrote: Great work, Jonathan! Notice how simple your script has become - and that's a good sign as well in Perl. :) We can make it even simpler, however. As you probably know, Perl has two fundamental types of collections: arrays (where data is stored as a sequence of elements, data chunks) and hashes (where data chunks are unordered, but stored with some unique key used to retrieve it). Sometimes hashes are used just to sort out (non-)unique data, but that's another story. Now look at this line: push @{$files{$filesize}}, $File::Find::name; Don't you see something... weird? You're using hash where filesizes are the keys - and because, yes, they may well be non-unique, you have to store arrays of filenames in your hash instead... But much more natural (at least, for me) is to organize your hash (let's call it %filedata) so that filenames (which are unique by their nature) become the keys. And some info about these files - sizes and md5-hashes - become the values. Yes, file names in a given directory _have_ to be unique, however... For example, our `wanted` (btw, its name is misleading a bit, no? may be 'process' will sound better?) sub may look as follows: find(\wanted, $path); my %filedata; sub wanted { return if substr($_, 0, 1) eq '.' || -d $_; my $filesize = -s _; open my $fh, '', $_ or die $!, $/; my $filemd5 = Digest::MD5-new-addfile($fh)-hexdigest; close $fh; $filedata{$_} = [$filesize, $filemd5]; You are traversing a directory tree, so using $_ as the key may cause collisions across different directories. Better to use $File::Find::name which contains the full absolute path name. John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: File Size Script Help - Working Version
On Thu, Dec 29, 2011 at 5:08 PM, Igor Dovgiy ivd.pri...@gmail.com wrote: Hi Jonathan, Let's review your script a bit, shall we? ) It's definitely good for a starter, but still has some rough places. #!/usr/bin/perl # md5-test.plx use warnings; use strict; use File::Find; use Digest::MD5; use File::Spec; So far, so good. ) my $dir = shift || '/Users/jonharris/Documents/begperl'; Nice touch, setting up a default param. ) The name of variable might seem too generic to some, but then again, it's the only directory we deal with, so... my ($dircontents, $path, @directory, $fh, $wr_fh); Incoming! Well, it's usually better to declare your variables right before you'll really need them... Your script is short, so you'll hardly have a chance to forget what $fh and $wr_fh mean, though. ) @directory = $dir; $path = File::Spec-catfile( @directory, $dircontents ); Ahem. At least three 'wtf' moments for me. ) First of all, File::Spec-catfile is really just a glorified join operator with some additional operations depending on which system you're using. So, second, it makes a little sense to convert $dir into @directory (documentation example is just that, an example) and to pass there undefined $dircontents as well. But the major one is why you'd ever have to pass your $dir through File::Spec? It's, you know, user input... ) opendir (my $in, $path) or die Cannot open $dir: $!\n; So you're trying to open $path, but warn about failure to open $dir? ) But then again, that's a minor quarrel, considering this: find (\wanted, $path); See, File::Find is convenient method which _emulates_ the whole opendir-readdir-closedir pattern for a given directory. The 'wanted' subroutine (passed by ref) will be called for each file found in $path. It's described really well in perldoc (perldoc File::Find). close $in; Opendir, but close - and not closedir? Now I'm confused. ) opendir (my $totalin, $path) or die Cannot open $dir: $!\n; find (\cleanup, $path); close $totalin; You don't have to use different variable to store temporary file handle (which will be closed in three lines) - and that will save you a few moments spent working out a new (but rational) name for it. :) But then again, you don't need to open the same dir twice: you can call cleanup (with the same 'find (\cleanup)... ' syntax) whenever you want. And you don't really need cleanup... whoops, going too far too soon. :) print \n\nAll Done!\n\n; sub wanted { while ($dircontents = readdir($in)) { Did I say that you're using two alternative methods of doing the same thing? ) But there's another big 'no-no' here: you're using external variable ($dircontents) when you really have perfectly zero reasons to do so. Of course, you don't need to push dirhandle ($in) from outer scape into this sub, when using find... ($File::Find::dir will do), but that's explainable at least. ) if ($dircontents=~/^\./ || -d $dircontents) { next; } So now the script ignores all the files which names begin with '.', and you really wanted just to ignore '.' and '..' ... ) my $bytes = -s $dircontents; print $dircontents, \n; print $bytes, bytes, \tSo far so good!\n; Yes. ) open $fh, , $dircontents or die $!; open $wr_fh, +, $path $dircontents.md5 or die $!; ## was unable to concatenate here, hence sub cleanup to remove the ' ' What was wrong with ... open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ ? my $hex = Digest::MD5-new-addfile($fh)-hex digest; print Hex Digest: , $hex, \n\n; print $wr_fh $hex, \n, $bytes, \n\n; All looks great for now: you're calculating md5 and size, and writing it into file with md5 extension... return($hex); ... but now you're abruptly jumping out of the while block, making the whole point of cycle, well, completely pointless. Not great. close $wr_fh; close $fh; } } # The following is mostly not original code - thanks to the author! sub cleanup { my @filelist = readdir($totalin); foreach my $oldname (@filelist) { next if -d $oldname; my $newname = $oldname; $newname =~ s/\ //; So you don't have spaces in your filenames. Great for you. ) rename $oldname, $newname; } } # End # And here we finish. Computers are not smart. They're dumb. But they're fast. And obedient. ) That's why they're really helpful in letting you do what you're trying to do... but only if you KNOW what you're trying to do. Imagine that you're - and not your computer - will be doing this task. Sit in one place - and write down your program as you (and not your computer) will be running it. Step by step. Bit by bit. Then convert your notes into some Perl form - and you'll instantly see the difference between now and then. ) -- iD Hi Igor Many thanks for your response I have started reviewing the things you said There are some silly mistakes in there - eg not using closedir It's a good
Re: File Size Script Help - Working Version
Hi Jonathan, Let's review your script a bit, shall we? ) It's definitely good for a starter, but still has some rough places. #!/usr/bin/perl # md5-test.plx use warnings; use strict; use File::Find; use Digest::MD5; use File::Spec; So far, so good. ) my $dir = shift || '/Users/jonharris/Documents/begperl'; Nice touch, setting up a default param. ) The name of variable might seem too generic to some, but then again, it's the only directory we deal with, so... my ($dircontents, $path, @directory, $fh, $wr_fh); Incoming! Well, it's usually better to declare your variables right before you'll really need them... Your script is short, so you'll hardly have a chance to forget what $fh and $wr_fh mean, though. ) @directory = $dir; $path = File::Spec-catfile( @directory, $dircontents ); Ahem. At least three 'wtf' moments for me. ) First of all, File::Spec-catfile is really just a glorified join operator with some additional operations depending on which system you're using. So, second, it makes a little sense to convert $dir into @directory (documentation example is just that, an example) and to pass there undefined $dircontents as well. But the major one is why you'd ever have to pass your $dir through File::Spec? It's, you know, user input... ) opendir (my $in, $path) or die Cannot open $dir: $!\n; So you're trying to open $path, but warn about failure to open $dir? ) But then again, that's a minor quarrel, considering this: find (\wanted, $path); See, File::Find is convenient method which _emulates_ the whole opendir-readdir-closedir pattern for a given directory. The 'wanted' subroutine (passed by ref) will be called for each file found in $path. It's described really well in perldoc (perldoc File::Find). close $in; Opendir, but close - and not closedir? Now I'm confused. ) opendir (my $totalin, $path) or die Cannot open $dir: $!\n; find (\cleanup, $path); close $totalin; You don't have to use different variable to store temporary file handle (which will be closed in three lines) - and that will save you a few moments spent working out a new (but rational) name for it. :) But then again, you don't need to open the same dir twice: you can call cleanup (with the same 'find (\cleanup)... ' syntax) whenever you want. And you don't really need cleanup... whoops, going too far too soon. :) print \n\nAll Done!\n\n; sub wanted { while ($dircontents = readdir($in)) { Did I say that you're using two alternative methods of doing the same thing? ) But there's another big 'no-no' here: you're using external variable ($dircontents) when you really have perfectly zero reasons to do so. Of course, you don't need to push dirhandle ($in) from outer scape into this sub, when using find... ($File::Find::dir will do), but that's explainable at least. ) if ($dircontents=~/^\./ || -d $dircontents) { next; } So now the script ignores all the files which names begin with '.', and you really wanted just to ignore '.' and '..' ... ) my $bytes = -s $dircontents; print $dircontents, \n; print $bytes, bytes, \tSo far so good!\n; Yes. ) open $fh, , $dircontents or die $!; open $wr_fh, +, $path $dircontents.md5 or die $!; ## was unable to concatenate here, hence sub cleanup to remove the ' ' What was wrong with ... open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ ? my $hex = Digest::MD5-new-addfile($fh)-hex digest; print Hex Digest: , $hex, \n\n; print $wr_fh $hex, \n, $bytes, \n\n; All looks great for now: you're calculating md5 and size, and writing it into file with md5 extension... return($hex); ... but now you're abruptly jumping out of the while block, making the whole point of cycle, well, completely pointless. Not great. close $wr_fh; close $fh; } } # The following is mostly not original code - thanks to the author! sub cleanup { my @filelist = readdir($totalin); foreach my $oldname (@filelist) { next if -d $oldname; my $newname = $oldname; $newname =~ s/\ //; So you don't have spaces in your filenames. Great for you. ) rename $oldname, $newname; } } # End # And here we finish. Computers are not smart. They're dumb. But they're fast. And obedient. ) That's why they're really helpful in letting you do what you're trying to do... but only if you KNOW what you're trying to do. Imagine that you're - and not your computer - will be doing this task. Sit in one place - and write down your program as you (and not your computer) will be running it. Step by step. Bit by bit. Then convert your notes into some Perl form - and you'll instantly see the difference between now and then. ) -- iD
re: File Size Script Help - Working Version
Hi All Firstly, many thanks for your help previously (19/12/11) - it has led to making a useable script I don't think it's brilliantly written, it seems a little bodged together to me... but works fine - not a bad result for a first script If you are new to this problem and are interested in the previous thread, I have attached it for you as a text file I have done everything I can think of now to follow the previous advice The script is portable, skips directories, creates digests of files, uses better Perl practice (e.g., no more barewords, correct lexical file handles etc) I only have a couple of questions left - wondering if you can help : - One thing that was recommended was to ensure that the File Handles are opened outside of the loop - I really can't figure out how to do this and keep the program working! Doesn't it need to be inside the loop to be iterated over? - Also, I had to open two file handles to get addfile()-hexdigest to work so that the value could be passed - this can't be correct?! - Writing back was messy - I was struggling with concatenating variables to keep the script portable - Is it possible to put values into a hash, and then print each hash entry to a separate file? There are clearly better ways to achieve this result - all suggestions are gratefully received! Thanks again Jon Here's the script: # # This program reads in files from a directory, produces a hex digest and writes the hex, along with # the file size into a newly created file with the same name and a '.md5' extension, to the original directory #!/usr/bin/perl # md5-test.plx use warnings; use strict; use File::Find; use Digest::MD5; use File::Spec; my $dir = shift || '/Users/jonharris/Documents/begperl'; my ($dircontents, $path, @directory, $fh, $wr_fh); @directory = $dir; $path = File::Spec-catfile( @directory, $dircontents ); opendir (my $in, $path) or die Cannot open $dir: $!\n; find (\wanted, $path); close $in; opendir (my $totalin, $path) or die Cannot open $dir: $!\n; find (\cleanup, $path); close $totalin; print \n\nAll Done!\n\n; sub wanted { while ($dircontents = readdir($in)) { if ($dircontents=~/^\./ || -d $dircontents) { next; } my $bytes = -s $dircontents; print $dircontents, \n; print $bytes, bytes, \tSo far so good!\n; open $fh, , $dircontents or die $!; open $wr_fh, +, $path $dircontents.md5 or die $!; ## was unable to concatenate here, hence sub cleanup to remove the ' ' my $hex = Digest::MD5-new-addfile($fh)-hex digest; print Hex Digest: , $hex, \n\n; print $wr_fh $hex, \n, $bytes, \n\n; return($hex); close $wr_fh; close $fh; } } # The following is mostly not original code - thanks to the author! sub cleanup { my @filelist = readdir($totalin); foreach my $oldname (@filelist) { next if -d $oldname; my $newname = $oldname; $newname =~ s/\ //; rename $oldname, $newname; } } # End # Jonathan Harris Dec 19 (10 days ago) to beginners, me Hi Perl Pros This is my first call for help I am a totally new, self teaching, Perl hopeful If my approach to this script is simply wrong, please let me know as it will help my learning! The script aims to: 1) Read in a directory either from the command line, or from a default path 2) Produce a hash for future checksum 3) Write this (hex digest) to a separate file, in a sub directory of the parent, which has the same name and a .md5 extension 4) Check the original file for its size 5) Add this data to the newly created file on a new line (in bytes) I have a script that will do most of this, except for analysing the file size - I think that the file size being analysed may be the md5 object result as the same value is printed to each file I am running out of ideas and would appreciate any help you could give! I have tried using File::stat::OO and File::stat - but to no avail - I could be using them incorrectly! Many thanks in advance... Jon. Here are some details: System: Mac OSX 10.7.2 Perl version 5.12 Script: #!/usr/bin/perl # md5-test-3.plx use warnings; use strict; use Digest::MD5; my $filesize = 0; my $dir = shift || '/Users/jonharris/Documents/begperl'; opendir (DH, $dir) or die Couldn't open directory: $!; my $md5 = Digest::MD5-new; while ($_ = readdir(DH)) { $md5-add($_); $filesize = (stat(DH))[7]; Is it necessary to put the following into a new loop? foreach ($_) { open FH, /Users/jonharris/Documents/begperl/md5/$_.md5 or die $!; binmode(FH); print FH $md5-hexdigest, \n, $filesize; } close FH; } close DH; print \n$dir\n\n; ### Jim Gibson via perl.org Dec 19 (10 days ago) to beginners On 12/19/11 Mon Dec 19, 2011 11:32 AM, Jonathan Harris jtnhar...@googlemail.com scribbled: Hi Perl Pros This is my first call for help I am a totally new, self teaching, Perl hopeful If my approach to this script is simply wrong, please let me know as it will help my learning! The
Re: File Size Script Help - Working Version
Jonathan Harris wrote: Hi Igor Many thanks for your response I have started reviewing the things you said There are some silly mistakes in there - eg not using closedir It's a good lesson in script vigilance I found the part about opening the file handle particularly interesting I had no idea that open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ was possible Now it's time to sit down and digest all this..and rewrite the script to make it better! Igor made a lot of good points. Here are my two cents worth. You are using the File::Find module to traverse the file system and add new files along the way. This _may_ cause problems on some file systems. It would probably be better to get a list of applicable files first and then use that list to create your new files. And you should have some way to handle the situation where a file exists that already has an '.md5' file or an '.md5' file exists with no corresponding plain file. John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: File Size Script Help - Working Version
On Thu, Dec 29, 2011 at 6:39 PM, John W. Krahn jwkr...@shaw.ca wrote: Jonathan Harris wrote: Hi Igor Many thanks for your response I have started reviewing the things you said There are some silly mistakes in there - eg not using closedir It's a good lesson in script vigilance I found the part about opening the file handle particularly interesting I had no idea that open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ was possible Now it's time to sit down and digest all this..and rewrite the script to make it better! Igor made a lot of good points. Here are my two cents worth. You are using the File::Find module to traverse the file system and add new files along the way. This _may_ cause problems on some file systems. It would probably be better to get a list of applicable files first and then use that list to create your new files. And you should have some way to handle the situation where a file exists that already has an '.md5' file or an '.md5' file exists with no corresponding plain file. John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/ Hi John Thanks for your 2 cents I hadn't considered that the module wouldn't be portable If that is the case, then maybe it would be best to ditch File::Find altogether? Have you had experience with the module causing issues on certain systems? It would be a shame, as I've just got it working! - Thanks to Igor, I no longer use the unnecessary dir handles! I agree that it may be worth examining the directory for existing .md5 files and skipping them I'll look in to adding that in to the code All the best and thanks for your help Jonathan
Re: File Size Script Help - Working Version
On Fri, Dec 30, 2011 at 12:33 AM, Jonathan Harris jtnhar...@googlemail.comwrote: On Thu, Dec 29, 2011 at 6:39 PM, John W. Krahn jwkr...@shaw.ca wrote: Jonathan Harris wrote: Hi Igor Many thanks for your response I have started reviewing the things you said There are some silly mistakes in there - eg not using closedir It's a good lesson in script vigilance I found the part about opening the file handle particularly interesting I had no idea that open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ was possible Now it's time to sit down and digest all this..and rewrite the script to make it better! Igor made a lot of good points. Here are my two cents worth. You are using the File::Find module to traverse the file system and add new files along the way. This _may_ cause problems on some file systems. It would probably be better to get a list of applicable files first and then use that list to create your new files. And you should have some way to handle the situation where a file exists that already has an '.md5' file or an '.md5' file exists with no corresponding plain file. John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/ Hi John Thanks for your 2 cents I hadn't considered that the module wouldn't be portable If that is the case, then maybe it would be best to ditch File::Find altogether? Have you had experience with the module causing issues on certain systems? It would be a shame, as I've just got it working! - Thanks to Igor, I no longer use the unnecessary dir handles! I agree that it may be worth examining the directory for existing .md5 files and skipping them I'll look in to adding that in to the code All the best and thanks for your help Jonathan Hi All Final question for Igor I tried to use your suggestion open my $wr_fh, '', File::Spec-catfile($path, $dircontents, '.md5') or die $!, $/ but it returned an error on the command line: 'Not a directory' At which point the program dies (which is what it is supposed to do!) I used it inside the loop - sorry to bug you for clarification if ($dircontents=~/^\./ || -d $dircontents) { next; } This is also to avoid the file .DS_Store # FInally, I was advised by a C programmer to declare all variables at the start of a program to avoid memory issues Is this not necessary in Perl? The rest of it is going really well - hope to post new and improved code soon!
Re: File Size Script Help - Working Version
Jonathan Harris wrote: On Thu, Dec 29, 2011 at 6:39 PM, John W. Krahnjwkr...@shaw.ca wrote: Igor made a lot of good points. Here are my two cents worth. You are using the File::Find module to traverse the file system and add new files along the way. This _may_ cause problems on some file systems. It would probably be better to get a list of applicable files first and then use that list to create your new files. And you should have some way to handle the situation where a file exists that already has an '.md5' file or an '.md5' file exists with no corresponding plain file. Hi John Thanks for your 2 cents I hadn't considered that the module wouldn't be portable That is not what I was implying. I was saying that when you add new files to a directory that you are traversing you _may_ get irregular results. It depends on how your operating system updates directory entries. If that is the case, then maybe it would be best to ditch File::Find altogether? Have you had experience with the module causing issues on certain systems? It would be a shame, as I've just got it working! - Thanks to Igor, I no longer use the unnecessary dir handles! I agree that it may be worth examining the directory for existing .md5 files and skipping them I'll look in to adding that in to the code John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/
Re: File Size Script Help - Working Version
Jonathan Harris wrote: FInally, I was advised by a C programmer to declare all variables at the start of a program to avoid memory issues Is this not necessary in Perl? It is not really necessary in C either. John -- Any intelligent fool can make things bigger and more complex... It takes a touch of genius - and a lot of courage to move in the opposite direction. -- Albert Einstein -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/