RE: remove the stop words
Is there a good method to do this? I need to remove the stop words from the comment field of every record. There are about 20,000 records. The comments look like this: Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in tergenic region amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2 5/25/99^^ I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is there a efficient way to do this? How about: code #!perl -w use strict; my ($r,$tmp) = '' x 2; my $input = 'blah srand and spin in with within the their'; my @s_words = qw(and in with the); for(@s_words) { $tmp .= \\b$_\\b; $tmp .= '|' unless $_ eq $s_words[$#s_words]; } $r = qr/$tmp/is; print $r; print \n\n$input\n\n; $input =~ s/$r//g; print $input\n; end It builds a regex using your search words and then applies it to a string. HTH, -dave -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: remove the stop words
Ying Liu wrote at Mon, 03 Jun 2002 17:11:22 +0200: Is there a good method to do this? I need to remove the stop words from the comment field of every record. There are about 20,000 records. The comments look like this: Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in tergenic region amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2 5/25/99^^ I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is there a efficient way to do this? Let's try: my %stop_words = map {$_ = 1} qw(and in with The); my $text = 'TEXT'; Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in tergenic region amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2 5/25/99^^ TEXT my $clean_text = join , grep {! $stop_words{$_}} split /\W+/, $text; (In fact it will remove all dots, commata and so on, too) If you don't want it, try instead: my $clean_text = join , grep {! $stop_words{$_}} split /(\W+)/, $text; Best Wishes, Janek -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
remove the stop words
Hi, Is there a good method to do this? I need to remove the stop words from the comment field of every record. There are about 20,000 records. The comments look like this: Yersinia pestis strain Nepal (aka CDC 516 or 369 isolated from human) 16S-23S in tergenic region amplified with 16UNIX and 23UNII primers. Sequencing primers were UNI1 and UNI2 5/25/99^^ I should remove 'and' 'in' 'with' 'The', etc. I have set up the stop words array. Is there a efficient way to do this? Thanks, Ying Liu - Do You Yahoo!? Yahoo! - Official partner of 2002 FIFA World Cup