On 8 Dec 2014, at 21:21, apeshifter <ch_k...@gmx.de> wrote:

> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>> typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)

It is difficult to tell without a fully reproducible example, but from this 
code I get the impression that word1 and word2 represent word pair _tokens_ 
rather than pair _types_ (otherwise you wouldn't need the unique()).  That's a 
very inefficient way of dealing with co-occurrence data, especially since 
you've already computed the set of pair types in order to get the co-occurrence 
counts.

If word1, word2 are type vectors (i.e. every pair occurs just once), then this 
should give you what you want:

        tapply(BB$word2, BB$word1, length)

If they are token vectors, you need to supply your own type counting function, 
which will be a bit slower

        tapply(BB$word2, BB$word1, function (x) length(unique(x)))

On my machine, this takes about 0.2s for 770,000 word pairs.


BTW, you might want to take a look at Unit 4 of the SIGIL course

        http://sigil.r-forge.r-project.org/

which has some tips on how you can deal efficiently with co-occurrence data in 
R.

Best,
Stefan

         


        
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to