On Jun 22, 5:48 pm, [EMAIL PROTECTED] (Andrej Kastrin) wrote:
> Dear all,
>
> I wrote a simple sql querry to count co-occurrences between words but it
> performs very very slow on large datasets. So, it's time to do it with
> Perl. I need just a short tip to start out: which structure to use to
> count all possible occurrences between letters (e.g. A, B and C) under
> the particular document number. My dataset looks like following:
>
> 1 A
> 1 B
> 1 C
> 1 B
> 2 A
> 2 A
> 2 B
> 2 C
> etc. till doc. number 100.000
>
> The result file should than be similar to:
> A B 4 ### 2 co-occurrences under doc. number 1 + 2 co-occurrences
> under doc. number 2
> A C 3 ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under
> doc. number 2
> B C 3 ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under
> doc. number 2
>
> Thanks in advance for any pointers.
>
> Best, Andrej
use strict;
use warnings;
my %pairs;
{
my ($prev_doc_id,%word_count,$doc_id);
# I've written this inner-sub as anonymous even though in this
# simple example script there's no outer sub. In the general case
# there would be and outer sub and perl doesn't implement nested
# nonymous subs.
my $add_to_total = sub {
for my $first ( keys %word_count ) {
for my $second ( keys %word_count ) {
unless ( $first eq $second ) {
$pairs{"$first $second"} +=
$word_count{$first} * $word_count{$second};
}
}
}
%word_count=();
$prev_doc_id = $doc_id;
};
while(<DATA>) {
( $doc_id, my $word) = /^(\d+) (\w+)$/ or die;
$add_to_total->()
unless defined $prev_doc_id && $doc_id eq $prev_doc_id;
$word_count{$word}++;
}
$add_to_total->();
}
for ( sort keys %pairs ) {
print "$_ $pairs{$_}\n";
}
__DATA__
1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/