Re: Perl best practices

Paul Lussier Thu, 13 Sep 2007 20:59:39 -0700

For all those just tuning in, Ben and I are in violent and vocal
agreement with each other, and at this point are merely quibbling over
semantics :)

"Ben Scott" <[EMAIL PROTECTED]> writes:

>   Er, yes.  "blah" in this case was meta-syntactic, and I was still
> thinking of the first example in this discussion, which had LTS
> (Leaning Toothpick Syndrome).  I will use // if the regexp doesn't
> suffer from LTS.  I use m{} or s{}{} when the regexp otherwise
> contains slashes.

Something about the use {} and () in regexps really bothers me.  I
think it's because in general, perl overloads too many things to begin
with.  To use {} for regexp delimiting is confusing and completely
non-intuitive to me. They are meant to denote either a hash element or
a code block.  Trying to make my mind use them for regexps hurts :)

To avoid LTS and backslashitis in a regexp, I tend to do something like:

   m|/foo/bar|/bar/baz|g;

The | is close enough to / that it's instantly clear to me.

> "A foolish consistency is the hobgoblin of little minds."  -- Ralph
> Waldo Emerson

Yeah, what the old, dead guy said :)

> As a completely non-contrived example, here is an illustration of
> when I think implicit use of $_ is very appropriate.
[...]
> sub condense_type($) {
> # condense a MIME content type to something shorter, for people
> $_ = $_[0];
> s{^text/plain$}         {text};
> s{^text/html$}          {html};
> s{^text/css$}           {css};
> s{^text/javascript$}    {jscript};
> s{^text/xml$}           {xml};
> s{^text/}               {};
> s{^image/.*}            {image};
> s{^video/.*}            {video};
> s{^audio/.*}            {audio};
> s{^multipart/byteranges}{bytes};
> s{^application/}        {};
> s{^octet-stream$}       {binary};
> s{^x-javascript$}       {jscript};
> s{^x-shockwave-flash$}  {flash};
> s{\*/\*}                {stars};        # some content gets marked */*
> return $_;
> }
>
> I could have used a regular named variable (say, $type) and
> repeated "$type =~" over and over again for 14 lines.  I believe
> that would actually harm the readability of the code.

Agreed, though, we (by we, I mean the company which currently puts
food on my table :) do things like this slightly differently,
completely avoiding the $_ dilemma:

  $match = shift;
  %mimeTypes = 
    ('^text/plain$'          => "text"   ,
     '^text/html$'           => "html"   ,
     '^text/css$'            => "css"   ,
     '^text/javascript$'     => "jscript",
     '^text/xml$'            => "xml"   ,
     '^text/'                => ""      ,
     '^image/.*'             => "image"  ,
     '^video/.*'             => "video"  ,
     '^audio/.*'             => "audio"  ,
     '^multipart/byteranges' => "bytes"  ,
     '^application/'         => ""      ,
     '^octet-stream$'        => "binary" ,
     '^x-javascript$'        => "jscript",
     '^x-shockwave-flash$'   => "flash"  ,
     '*/*'                   => "stars"  ,# some content gets marked */*
    );

  foreach my $mtype (keys %mimeTypes) {
    if ($mtype =~ /$match/)
      return $mimeType{$mtype};
    }
  }

Also, the foreach could be written as:

  map { ( $match =~ /$_/) && return $mimeTypes{$_}} keys %mimeTypes

Though I find this completely readable, it suffers from the problem
that it's not easily extensible.  If you decide you need to do more
processing within the loop, the foreach is much easier to extend.  You
just plonk another line in there and operate on the already existing
variables.  With the map() style loop, this becomes more difficult.

So, though I love map(), I would have to argue this is not the best
place to use it.  Once readability has been achieved, the next
priority ought to be future maintenance and extensibility, IMO.

>   As a counter-example from the same script, here's something using
> explicit names and grouping which isn't strictly needed, because I
> find it clearer:
>
> sub condense_size($) {
> # consense a byte-count into K/M/G
> my $size = $_[0];
> if    ($size > $gigabyte) { $size = ($size / $gigabyte) . "G"; }
> elsif ($size > $megabyte) { $size = ($size / $megabyte) . "M"; }
> elsif ($size > $kilobyte) { $size = ($size / $kilobyte) . "K"; }
> return $size;
> }

I tend to like this style too, though I'd use a slightly different
syntax.  It's otherwise exactly the same.

  my $size = shift;
  ($size > $gigabyte) && { return (($size/$gigabyte) . "G")};
  ($size > $megabyte) && { return (($size/$megabyte) . "M")};
  ($size > $kilobyte) && { return (($size/$kilobyte) . "K")};

Or, perhaps, if you wanted to be a little more cleverer:

  my  %units = ($gigabyte => sub { int($_[0]/$gigabyte) . 'G'},
                $megabyte => sub { int($_[0]/$megabyte) . 'M'},
                $kilobyte => sub { int($_[0]/$kilobyte) . 'K'},
             );

  foreach my $base (sort {$b <=> $a } keys %units) {
    if ($size > $base) {
      print ($units{$base}->($size),"\n");
      last;
    }
  }

This last approach is both too clever by 1, but also, slightly easier
to maintain, given that to add another size, you add one line.  You
can even add the one line anywhere you want in the hash.  Since the
keys are integers, they'll sort correctly in the foreach loop.  If you
wanted to be *really* clever(-2 points!), you could probably do this:

  my  %units = ($gigabyte => sub { int($_[0]/$_[1]) . 'G'},
                $megabyte => sub { int($_[0]/$_[1]) . 'M'},
                $kilobyte => sub { int($_[0]/$_[1]) . 'K'},
             );

and call into the hash like this:

  $units{$base}->($size, $base),"\n");

But now that I've done that, it ought to be obvious that we can easily
factor out the division so it only happens once:

  my  %units = ($gigabyte => 'G',
                $tera     => 'T',
                $megabyte => 'M',
                $kilobyte => 'K',
               );

  foreach my $base (sort {$b <=> $a } keys %units) {
    if ($size > $base) {
      print (int($size/$base),$units{$base},"\n");
      last;
    }
  } 

This last form has all the advantages of readability, extensibility,
and ease of maintenance, without any of the repetition.

>> Right, which is why you shouldn't depend upon $_ in these contexts and
>> explicitly state a variable name ...
>
>   A named variable would be *two* more things.  ;-)

Not if done correctly, as I did above :) (at least IMO)

>>   my $file = shift;
>
>   You're using an implicit argument to shift there.  ;-)

Yeah, I am.  We don't actually do that here at work.  We do things
like this:

  my ($file) = assertNumArgs(@_, 1);

Which enforces that we only pass in 1 argument if that's all we're
expecting.  assertNumArgs() compares the number passed to it with the
number of elements in @_, and croaks() if they don't match.  It's very
nice error checking.  Alas, assertNumArgs() is a part of a huge
home-grown library of routines we have.

>> The compelling argument is this: It should be blatantly obvious to
>> whomever is going to be maintaining your code in 6 months what you
>> were thinking
>
>   I do not think I could agree with you more here.  The thing you seem
> to be ignoring in my argument is that "clarity" is subjective and
> often depends on context.  :)

I'm not ignoring it.  I'm saying that where you have stopped because
you think it is sufficiently clear can in fact be made cleaner and
clearer for the sakes of both clarity and future maintenance.

I'm assuming/guessing that you're coming at this argument from the
point of view of "I'm going to write a simple script to do something
not overly complex, so why waste a lot of time with it, it's good
enough".

I'm coming at it from a software development context where whatever is
being written is ultimately going to be part of a much larger whole.

We have litterally hundreds of thousands of lines of perl code which
has evolved over a 7+ year period.  Any number of people have had
their hands in this code.  If strict coding standards and practices
had not been adhered to, it would be nigh impossible to maintain,
never mind further development of.  As it is, I, who am *NOT* a
software developer by training (or any other definition :) and who is
nothing more than self-taught in perl, have been able to jump into
this code and contribute quite a bit to it.  I have both furthered
code written by others, and others have extended/debugged mine.

> I'm not going to penalize the competent because there are others who
> are incompetent.

You don't have to.  I don't think any of the code I wrote above
penalizes the competent.  If anything, it's nice, clean code written
without any assumptions other that of some (very little) competence of
the reader.

Other than the intermediary steps where I was using anonymous sub
routines as the value of a hash element, which is arguably an advanced
programming technique (not at all isolated to perl), there was nothing
very complicated about the code I wrote.  Especially the last version
which used nothing more than an extremely simple hash and a normal
foreach loop. In my opinion, that last version is both elegant and
easy to read, maintain, debug.

> Let me restate: A pattern which is powerful and easy-to-use is
> sometimes unavoidably non-obvious.

Agreed.  But it can be written such that it appears obvious.

Was my code not both powerful and easy-to-use?  I'll concede that it
might not be obvious, but when you look at it, is it not obvious what
I'm doing ?

> Or perhaps an example of a similar principle in a different context:
> When invoking tar from a shell script, which of the following do you
> prefer?
>
> tar --create --gzip --verbose --preserve-permissions
> --file=/path/to/file.tar.gz /etc
>
> tar czvpf /path/to/file.tar.gz /etc

Neither:

tar='/usr/bin/tar'
tarOptions='<some list of options in any way that makes sense>'
archivePath='/etc'
outputPath='/path/to/file.tar'

 $tar $tarOptions $outputPath $archivePath

Yes, I know I'm being a p.i.t.a. here :) But I also think that there's
a slight difference between the long/short options for a command line
utility and the nuances of perl written anyway you feel like it at the
moment.

There are only a finite number of options for any given command.  The
same is not true for writing perl code.

Either of the above ways you used tar is perfectly fine, IMO, since
the tar man page is fairly short and concise compared to learning a
programming language.  Though, I will note, that if you look at the
process table of an AMANDA client using tar, you will see that they
opted to use all long option names.  Why?  Ease of understanding when
debugging, or, what I refer to as the 3:00am test.

If I need to debug that shell script above (or, as has been more
recently the case, amanda taking too long), it saves me a *huge*
amount of time to to be able to see in the ps output that tar is using
--preserve-permissions or --listed-incremental vs. having to resort to
reading the man page to re-remember what the -p or -g means,
especially if the args are not in a sane, logical order, eg. cpgfz,
which not be overly intuitive at 3:00 in morning when that "simple
script that can't possibly have a bug in it" goes kerflooey.

Which you rather have:

  tar lScCgf - /u1 /var/lib/amanda/gnutar-lists/rc_u1_0.new

or
   tar --create --file - --directory /u1 --one-file-system \
       --listed-incremental /var/lib/amanda/gnutar-lists/rc_u1_0.new
       --sparse --ignore-failed-read --totals .

I was fortunate that the authors of AMANDA chose the latter.  I
consider myself competant, and not the least penalized by their
choice; rather, thankful for it.

I now return you to your regularly scheduled off-topic conversation :)

-- 
Seeya,
Paul
_______________________________________________
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/

Re: Perl best practices

Reply via email to