Paul Lalli wrote:
On Jun 22, 12:48 pm, [EMAIL PROTECTED] (Andrej Kastrin) wrote:

I wrote a simple sql querry to count co-occurrences between words but it
performs very very slow on large datasets. So, it's time to do it with
Perl. I need just a short tip to start out: which structure to use to
count all possible occurrences between letters (e.g. A, B and C) under
the particular document number. My dataset looks like following:

1 A
1 B
1 C
1 B
2 A
2 A
2 B
2 C
etc. till doc. number 100.000

The result file should than be similar to:
A B 4   ### 2 co-occurrences under doc. number 1 + 2 co-occurrences
under doc. number 2
A C 3   ### 1 co-occurrence under doc. number 1 + 2 co-occurrences under
doc. number 2
B C 3   ### 2 co-occurrences under doc. number 1 + 1 co-occurrence under
doc. number 2

Maybe I'm just a little slow on the uptake, but I don't at all
understand the correlation between your sample input and sample
output.  Where did "A B 4" come from, and what does it mean for "2 co-
ocurrences" under doc number 1?  What is a co-occurrence? I see one
instance of "1 A", and two instances of "1 B".  How does that
translate to "2 co-ocurrences" of "A B"?

Can you explain your desired goal a little better?

Paul Lalli


1. under document number 1 letter A co-occurr two times with letter B (there are two A-B pairs: two As and one B); 2. under document number 2 letter A co-occur two times with letter B (the is one A and two Bs)
3. then you sum up and the result is 4 A-B pairs

Andrej

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to