Ying Liu wrote at Mon, 03 Jun 2002 17:11:22 +0200: > Is there a good method to do this? I need to remove the stop words from the comment >field of every > record. There are about 20,000 records. The comments look like this: > > Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in >tergenic region > amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2 >5/25/99^^ > > I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is >there a > efficient way to do this? >
Let's try: my %stop_words = map {$_ => 1} qw(and in with The); my $text = <<'TEXT'; Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in tergenic region amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2 5/25/99^^ TEXT my $clean_text = join " ", grep {! $stop_words{$_}} split /\W+/, $text; (In fact it will remove all dots, commata and so on, too) If you don't want it, try instead: my $clean_text = join "", grep {! $stop_words{$_}} split /(\W+)/, $text; Best Wishes, Janek -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]