Hi Bob,
In one of my tests I added the '>' to the character class [^\w->] but I still didn't get 'B0->'. I've just learned about character classes so I am trying to get a better handle on how they work. A lot of my titles contain physics terms like B0->K- and I would consider 'B0->' a word and 'K-' another word. thanks for the quick repy. Chee On Fri, 30 Jul 2004, Bob Showalter wrote: > Date: Fri, 30 Jul 2004 13:29:54 -0400 > From: Bob Showalter <[EMAIL PROTECTED]> > To: 'Charlotte Hee' <[EMAIL PROTECTED]>, [EMAIL PROTECTED] > Subject: RE: problem with splitting on "words" > > Charlotte Hee wrote: > > Hello All, > > > > I am having trouble splitting words from titles from a list of > > research papers. I thought I could split the title into words like so: > > > > #!/usr/local/bin/perl > > use locale; > > > > %forums = ( 1 => 'B0->K+K-Ks', > > 2 => 'B+->K+KsKs Decays', > > 3 => 'Measurement of the Total Width', > > 4 => 'Asymmetries in B0->K0s pi0 Decays' > > ); > > > > foreach $forum ( sort keys %forums ){ > > my $title = $forums{$forum}; > > foreach $w (split /[^\w-]+/, $title) { > > next unless ($w =~ /^[A-Za-z]/); > > $title =~ /\b\Q$w\E\b/; > > print "Journal $forum indexed word = " . ucfirst($w) . "\n"; > > } > > } > > > > exit; > > > > But the results show that I'm losing some characters: > > > > Journal 1 indexed word = B0- # this should be B0-> > > No, because > matches the character class [^\w-] > > > Journal 1 indexed word = K # what happened to the '+'? > > Same as above. > > > Journal 1 indexed word = K-Ks > > > > Journal 2 indexed word = B # '+->' missing > > The '-' is there, but you're only printing tokens that start with a letter. > > > Journal 2 indexed word = K # '+' missing > > Journal 2 indexed word = KsKs > > Journal 2 indexed word = Decays > > > > Journal 3 indexed word = Measurement > > Journal 3 indexed word = Of > > Journal 3 indexed word = The > > Journal 3 indexed word = Total > > Journal 3 indexed word = Width > > > > Journal 4 indexed word = Asymmetries > > Journal 4 indexed word = In > > Journal 4 indexed word = B0- # should be 'B0->' > > Journal 4 indexed word = K0s > > Journal 4 indexed word = Pi0 > > Journal 4 indexed word = Decays > > > > These are only example titles but the other titles have similar > > characters in them as part of a "word". I tried adding the '-' and > > '>' to my character class but that did not work. What am I doing > > wrong here? > > It's not clear what you're defining as a "word". I'm wondering why you > aren't just splitting on whitespace? > > foreach $w (split ' ', $title) { > -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>