RE: remove the stop words

2002-06-04 Thread David Gray

 Is there a good method to do this? I need to remove the stop 
 words from the comment field of every record. There are about 
 20,000 records. The comments look like this: 
 
 Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated 
 from human) 16S-23S in tergenic region amplified with 16UNIX 
 and 23UNII primers. Sequencing primers were UNI1 and UNI2   5/25/99^^
 
 I should remove 'and' 'in' 'with' 'The', etc. I have set up 
 the stop words array. Is there a efficient way to do this?

How about:

 code
 #!perl -w
 use strict;
 
 my ($r,$tmp) = '' x 2;
 my $input = 'blah srand and spin in with within the their';
 my @s_words = qw(and in with the);
 
 for(@s_words) {
   $tmp .=  \\b$_\\b;
   $tmp .= '|' unless $_ eq $s_words[$#s_words];
 }
 $r = qr/$tmp/is;
 print $r;
 
 print \n\n$input\n\n;
 $input =~ s/$r//g;
 print $input\n;
 end

It builds a regex using your search words and then applies it to a
string.

HTH,

 -dave



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: remove the stop words

2002-06-04 Thread Janek Schleicher

Ying Liu wrote at Mon, 03 Jun 2002 17:11:22 +0200:

 Is there a good method to do this? I need to remove the stop words from the comment 
field of every
 record. There are about 20,000 records. The comments look like this:
 
 Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in 
tergenic region
 amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2   
5/25/99^^
 
 I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is 
there a
 efficient way to do this?
 

Let's try:

my %stop_words = map {$_ = 1} qw(and in with The);

my $text = 'TEXT';
Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in 
tergenic region
amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2   
5/25/99^^
TEXT

my $clean_text = 
  join  , grep {! $stop_words{$_}} split /\W+/, $text;

(In fact it will remove all dots, commata and so on, too)
If you don't want it, try instead:

my $clean_text =
  join , grep {! $stop_words{$_}} split /(\W+)/, $text;


Best Wishes,
Janek

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




remove the stop words

2002-06-03 Thread Ying Liu


Hi,

Is there a good method to do this? I need to remove the stop words from the comment 
field of every record. There are about 20,000 records. The comments look like this: 

Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in 
tergenic region amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 
and UNI2   5/25/99^^

I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is 
there a efficient way to do this?

Thanks,

Ying Liu

 



-
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup