On Fri, Nov 11, 2011 at 09:46:26PM -0800, Sumtingwong wrote:
> What is needed?  ;-)  A frequency count of each word in the input file
> for each file that was searched.  For example, the first word of the
> input file is "it".  Document one is searched for "it" and it shows up
> 248 times.  Optimal output would be (in tabbed columns):
> it     Document 1     248

Here's what I came up with:

#!/usr/local/bin/perl

use strict;
use warnings;

use Getopt::Long;

GetOptions(
  "dictionary=s" => \  my $dictionary,
  "zero!"        => \  my $zero,
) or exit 1;

open my $dict_fh, '<', $dictionary
  or die "Can't open '$dictionary': $!\n";

my %dictionary;

while (<$dict_fh>) {
  chomp;
  $dictionary{$_} = 0;
}

while (<>) {
  while (/([a-z]+(?:[\'\-][a-z]+)*)/ig) {
    $dictionary{$1}++
      if exists $dictionary{$1};
  }
} continue {
  if (eof) {
    foreach my $word (sort keys %dictionary) {
      print "$word\t$ARGV\t$dictionary{$word}\n"
        if $zero || $dictionary{$word};
      $dictionary{$word} = 0;
    }
  }
}
__END__

This breaks up the text into words, and for each word, it updates the count
if that word appeared in the original word list.  Words can contain inner
apostrophes and/or dashes.

You would run it like so:

perl word_freq.pl -d my_word_list file1 file2 ...

If you want the output to include words with a count of zero, add -z before
the list of files.

Now, there is an important caveat.  You mentioned that you're working with
foreign language text.  If your words include characters outside of
[a-zA-Z\'\-], then the regular expression to find all the words in the text
will need to be modified.  It's important to make sure the regular
expression matches whole words.

Ronald

-- 
You received this message because you are subscribed to the 
"BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to bbedit@googlegroups.com
To unsubscribe from this group, send email to
bbedit+unsubscr...@googlegroups.com
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem, 
please email "supp...@barebones.com" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>

Reply via email to