Re: Elegant quoted word parsing

Jeff 'japhy' Pinyan Sun, 13 Jun 2004 05:39:49 -0700

On Jun 10, Beau E. Cox said:

>sub parse_words
>{
>    my $line = shift;
>    my @words = ();
>
>    $_ = $line;


You should localize $_ if you're going to be assigning to it explicitly.

  local $_ = $line;

>    while( 1 ) {
>        s/^\s*(.*?)\s*$/$1/;

This is not a very efficient way to remove leading and trailing whitespace
from a string (and it breaks if there are newlines INSIDE the string).
Sometimes, one must resist the urge to try and do everything in one regex.

  s/^\s+//;
  s/\s+$//;

will end up being much faster in removing leading and trailing spaces
(although for reasons I don't want to get into, the trailing-spaces regex
is not nearly as efficient as I'd like it to be).

>        last unless length $_;
>        pos( $_ ) = 0;
>        if( /^"(.*?)"/g   || /^'(.*?)'/g   ||
>            /^\/(.*?)\//g || /^\((.*?)\)/g ||
>            /^{(.*?)}/g   || /^\[(.*?)\]/g ||
>            /^<(.*?)>/g   || /^#(.*?)#/g
>            ) {

I would suggest a change in the mechanism you're using.  Instead of doing

  if ( /^(p1)/g or /^(p2)/g or /^(p3)/g or /^(p4)/g ) {
    push @w, $1;
    $_ = substr $_, pos($_);
  }

I would suggest using what I call the "inch-worm" approach, which uses the
\G anchor and the /gc modifiers.

  if ( /\G(p1)/gc or /\G(p2)/gc or /\G(p3)/gc or /\G(p4)/gc ) {
    push @w, $1;
  }

You don't need to keep track of pos() or modify $_ yourself.  The /c
modifier changes the meaning of the /g modifier slightly:  it says that if
the regex doesn't match, it should NOT clear pos(), which a /g regex
normally would.  The \G anchor says "match IMMEDIATELY where the last
regex left off", or more specifically, it anchors the regex to match at
the location of pos().

Here's a demonstration of /gc versus /g:

  $str = "perl";
  $str =~ /../g;  # sets pos($str) to 2
  if ($str =~ /(...)/g or $str =~ /(..)/g) {
    $x = $1;  # $x is 'pe'
  }

  $str = "perl";
  $str =~ /../g;  # sets pos($str) to 2
  if ($str =~ /(...)/gc or $str =~ /(..)/gc) {
    $y = $1;  # $y is 'rl'
  }

$x is 'pe' because when we do /(...)/g on $str, the regex fails to match,
and pos($str) is reset, so then /(..)/g matches the first two characters
of $str.  $y is 'rl' because of the /c modifier -- when /(...)/gc fails,
pos($str) is NOT changed, so the next regex, /(..)/gc, matches, and since
pos($str) is 2, it matches starting at that location (or later).

Here's a demonstration of \G:

  $str = "Perl";
  $str =~ /(..)/g;   # puts 'Pe' in $1 and sets pos($str) to 2
  $str =~ /\G(.)/g;  # this puts 'r' in $1

I'd say more, but I'm on vacation and I need to leave for church, so I'll
leave additional comments for later tonight or tomorrow morning.

-- 
Jeff "japhy" Pinyan      [EMAIL PROTECTED]      http://www.pobox.com/~japhy/
RPI Acacia brother #734   http://www.perlmonks.org/   http://www.cpan.org/
CPAN ID: PINYAN    [Need a programmer?  If you like my work, let me know.]
<stu> what does y/// stand for?  <tenderpuss> why, yansliterate of course.


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Elegant quoted word parsing

Reply via email to