Hi Bob,

In one of my tests I added the '>' to the character class [^\w->] but
I still didn't get 'B0->'. I've just learned about character classes
so I am trying to get a better handle on how they work. A lot of my titles
contain physics terms like B0->K- and I would consider 'B0->' a word and
'K-' another word.

thanks for the quick repy.  Chee

On Fri, 30 Jul 2004, Bob Showalter wrote:

> Date: Fri, 30 Jul 2004 13:29:54 -0400
> From: Bob Showalter <[EMAIL PROTECTED]>
> To: 'Charlotte Hee' <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
> Subject: RE: problem with splitting on "words"
>
> Charlotte Hee wrote:
> > Hello All,
> >
> > I am having trouble splitting words from titles from a list of
> > research papers. I thought I could split the title into words like so:
> >
> >   #!/usr/local/bin/perl
> >   use locale;
> >
> >   %forums = ( 1 => 'B0->K+K-Ks',
> >               2 => 'B+->K+KsKs Decays',
> >               3 => 'Measurement of the Total Width',
> >               4 => 'Asymmetries in B0->K0s pi0 Decays'
> >   );
> >
> >   foreach $forum ( sort keys %forums ){
> >      my $title = $forums{$forum};
> >      foreach $w (split /[^\w-]+/, $title) {
> >         next unless ($w =~ /^[A-Za-z]/);
> >         $title =~ /\b\Q$w\E\b/;
> >         print "Journal $forum indexed word = " .  ucfirst($w) . "\n";
> >       }
> >   }
> >
> > exit;
> >
> > But the results show that I'm losing some characters:
> >
> > Journal 1 indexed word = B0-    # this should be B0->
>
> No, because > matches the character class [^\w-]
>
> > Journal 1 indexed word = K      # what happened to the '+'?
>
> Same as above.
>
> > Journal 1 indexed word = K-Ks
> >
> > Journal 2 indexed word = B      # '+->' missing
>
> The '-' is there, but you're only printing tokens that start with a letter.
>
> > Journal 2 indexed word = K      # '+' missing
> > Journal 2 indexed word = KsKs
> > Journal 2 indexed word = Decays
> >
> > Journal 3 indexed word = Measurement
> > Journal 3 indexed word = Of
> > Journal 3 indexed word = The
> > Journal 3 indexed word = Total
> > Journal 3 indexed word = Width
> >
> > Journal 4 indexed word = Asymmetries
> > Journal 4 indexed word = In
> > Journal 4 indexed word = B0-   # should be 'B0->'
> > Journal 4 indexed word = K0s
> > Journal 4 indexed word = Pi0
> > Journal 4 indexed word = Decays
> >
> > These are only example titles but the other titles have similar
> > characters in them as part of a "word". I tried adding the '-' and
> > '>' to my character class but that did not work. What am I doing
> > wrong here?
>
> It's not clear what you're defining as a "word". I'm wondering why you
> aren't just splitting on whitespace?
>
>    foreach $w (split ' ', $title) {
>

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to