On 04/29/2012 10:41 AM, lina wrote:
On Sun, Apr 29, 2012 at 11:26 PM, Lawrence Statton<lawre...@cluon.com> wrote:
On 04/29/2012 10:21 AM, lina wrote:
Hi,
I have a text file like:
$ more sample.tex
aaa \cite{d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3}
bbb\cite{inhibitor}aaa
sub read_tex{
open my $fh, '<', @_;
while(<$fh>){
if(/cite\{(.+?)\}/){
push @citeditems,split/,/,$1;
}
}
close($fh);
}
It only extract the first \cite part out, failed to extract the e1,
f1, f2, f3 and uncertain number of being cited item out.
Can someone give me some suggest regarding how to upgrade the match part?
Thanks with best regards,
Your regexp only asks for a single \cite... match in each line.
perldoc perlretut
Search for "Global Matching" roughly half-way down the page.
if($_ =~ m/cite\{(.+?)\}/g){
I have never, in a decade of using Perl every day, used $1 or backrefs.
My personal preference is to always do matching in a list context, e.g.
my @thing = $target =~ m/Foo: (\w+) bar: (\w+) baz: (\w+))/;
From perlretut, I quote:
In list context, "//g" returns a list of matched groupings,
or if there are no groupings, a list of matches to
the whole regexp. So if we wanted just the words, we could use
@word = ($x =~ /(\w+)/g); # matches,
# $word[0] = 'cat'
# $word[1] = 'dog'
# $word[2] = 'house'
(Note that the docs (at least on my copy of perl) have a typo ... it
says @words, not @word.)
So - let's take a quick pass at your problem, breaking it down into pieces.
First - let's get a list of the contents of each cite{...} into an array
per line.
while (my $line = <$fh>) {
my @cite_for_line = $line =~ m/cite\{(.+)\}/g;
print ">>$_<<\n" for @cite_for_line;
}
Which produces (incorrectly) ...
>>d1,d2},ddd \cite{e1},ccc \cite{f1,f2,f3<<
>>inhibitor<<
Hrm ... it looks like the regexp is matching "too much"
If we go back to perlretut, we'll find around line 746 (your page may
vary) the helpful paragraph:
For all of these quantifiers, Perl will try to match as much of
the string as possible, while still allowing the regexp to
succeed. Thus with "/a?.../", Perl will first try to match the
regexp with the "a" present; if that fails, Perl will try to
match the regexp without the "a" present. For the quantifier
"*", we get the following:
$x = "the cat in the hat";
$x =~ /^(.*)(cat)(.*)$/; # matches,
# $1 = 'the '
# $2 = 'cat'
# $3 = ' in the hat'
Which is what we might expect, the match finds the only "cat"
in the string and locks onto it. Consider, however, this
regexp:
$x =~ /^(.*)(at)(.*)$/; # matches,
# $1 = 'the cat in the h'
# $2 = 'at'
# $3 = '' (0 characters match)
So, we are getting "greedy" matches (search for greedy in perlretut for
more information on that.)
What we want in our regexp in YOUR case is a "non greedy" match, which
(cutting to the chase) looks like THIS
my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;
The only difference is there is now a "?" inside the match grouping,
which says "rather than pick the LARGEST segment that matches, pick the
SMALLEST).
Running this code, gives the output
>>d1,d2<<
>>e1<<
>>f1,f2,f3<<
>>inhibitor<<
Great .. now, we can break up each of those elements using the split /,/
that you used before....
my @cite;
while (my $line = <$fh>) {
my @cite_for_line = $line =~ m/cite\{(.+?)\}/g;
push @cite , split /,/, for @cite_for_line;
}
print "$_\n" for @cite;
Which gives the (one assumes correct) answer of:
d1
d2
e1
f1
f2
f3
inhibitor
If you want to be even more clever, you can remove the intermediate
temporary variable @cite_for_line by doing this:
while (my $line = <$fh>) {
push @cite, split /,/, for $line =~ m/cite\{(.+?)\}/g;
}
--L
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/