Ying Liu wrote at Mon, 03 Jun 2002 17:11:22 +0200:

> Is there a good method to do this? I need to remove the stop words from the comment 
>field of every
> record. There are about 20,000 records. The comments look like this:
> 
> Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in 
>tergenic region
> amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2   
>5/25/99^^
> 
> I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is 
>there a
> efficient way to do this?
> 

Let's try:

my %stop_words = map {$_ => 1} qw(and in with The);

my $text = <<'TEXT';
Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in 
tergenic region
amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2   
5/25/99^^
TEXT

my $clean_text = 
  join " ", grep {! $stop_words{$_}} split /\W+/, $text;

(In fact it will remove all dots, commata and so on, too)
If you don't want it, try instead:

my $clean_text =
  join "", grep {! $stop_words{$_}} split /(\W+)/, $text;


Best Wishes,
Janek

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to