Re: how to get two matches out

Lawrence Statton Sun, 29 Apr 2012 09:11:06 -0700

On 04/29/2012 10:41 AM, lina wrote:

On Sun, Apr 29, 2012 at 11:26 PM, Lawrence Statton<lawre...@cluon.com>  wrote:

On 04/29/2012 10:21 AM, lina wrote:


Hi,

I have a text file like:

$ more sample.tex

aaa \cite{d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3}
bbb\cite{inhibitor}aaa


sub read_tex{
        open my $fh, '<', @_;
        while(<$fh>){
                if(/cite\{(.+?)\}/){
                push @citeditems,split/,/,$1;
                }
        }
        close($fh);
}

It only extract the first \cite part out, failed to extract the e1,
f1, f2, f3 and uncertain number of being cited item out.

Can someone give me some suggest regarding how to upgrade the match part?

Thanks with best regards,


Your regexp only asks for a single \cite... match in each line.

perldoc perlretut

Search for "Global Matching" roughly half-way down the page.


        if($_ =~ m/cite\{(.+?)\}/g){


I have never, in a decade of using Perl every day, used $1 or backrefs.

My personal preference is to always do matching in a list context, e.g.

   my @thing = $target =~ m/Foo: (\w+) bar: (\w+) baz: (\w+))/;

From perlretut, I quote:

       In list context, "//g" returns a list of matched groupings,
       or if there are no groupings, a list of matches to
       the whole regexp.  So if we wanted just the words, we could use

           @word = ($x =~ /(\w+)/g);  # matches,
                                       # $word[0] = 'cat'
                                       # $word[1] = 'dog'
                                       # $word[2] = 'house'

(Note that the docs (at least on my copy of perl) have a typo ... itsays @words, not @word.)


So - let's take a quick pass at your problem, breaking it down into pieces.

First - let's get a list of the contents of each cite{...} into an arrayper line.



 while (my $line = <$fh>) {
   my @cite_for_line = $line =~ m/cite\{(.+)\}/g;
   print ">>$_<<\n" for @cite_for_line;
 }

Which produces (incorrectly) ...

  >>d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3<<
  >>inhibitor<<

Hrm ... it looks like the regexp is matching "too much"

If we go back to perlretut, we'll find around line 746 (your page mayvary) the helpful paragraph:


       For all of these quantifiers, Perl will try to match as much of
       the string as possible, while still allowing the regexp to
       succeed.  Thus with "/a?.../", Perl will first try to match the
       regexp with the "a" present; if that fails, Perl will try to
       match the regexp without the "a" present.  For the quantifier
       "*", we get the following:

           $x = "the cat in the hat";
           $x =~ /^(.*)(cat)(.*)$/; # matches,
                                    # $1 = 'the '
                                    # $2 = 'cat'
                                    # $3 = ' in the hat'

       Which is what we might expect, the match finds the only "cat"
       in the string and locks onto it.  Consider, however, this
       regexp:

           $x =~ /^(.*)(at)(.*)$/; # matches,
                                   # $1 = 'the cat in the h'
                                   # $2 = 'at'
                                   # $3 = ''   (0 characters match)

So, we are getting "greedy" matches (search for greedy in perlretut formore information on that.)

What we want in our regexp in YOUR case is a "non greedy" match, which(cutting to the chase) looks like THIS


    my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;

The only difference is there is now a "?" inside the match grouping,which says "rather than pick the LARGEST segment that matches, pick theSMALLEST).


Running this code, gives the output

>>d1,d2<<
>>e1<<
>>f1,f2,f3<<
>>inhibitor<<

Great .. now, we can break up each of those elements using the split /,/that you used before....


  my @cite;

  while (my $line = <$fh>) {
    my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;
    push @cite , split /,/, for @cite_for_line;
  }

  print "$_\n" for @cite;


Which gives the (one assumes correct) answer of:

d1
d2
e1
f1
f2
f3
inhibitor

If you want to be even more clever, you can remove the intermediatetemporary variable @cite_for_line by doing this:


  while (my $line = <$fh>) {
    push @cite, split /,/, for $line =~ m/cite\{(.+?)\}/g;
  }


--L





--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: how to get two matches out

Reply via email to