On 8 Dec 2014, at 21:21, apeshifter <ch_k...@gmx.de> wrote: > The last relic of the afore-mentioned for-loop that goes through all the > word pairs and tries to calculate some statistics on them is the following > line of code: >> typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])])) > (where word1 and word2 are the first and second word within the two-word > sequence (all.word.pairs, above)
It is difficult to tell without a fully reproducible example, but from this code I get the impression that word1 and word2 represent word pair _tokens_ rather than pair _types_ (otherwise you wouldn't need the unique()). That's a very inefficient way of dealing with co-occurrence data, especially since you've already computed the set of pair types in order to get the co-occurrence counts. If word1, word2 are type vectors (i.e. every pair occurs just once), then this should give you what you want: tapply(BB$word2, BB$word1, length) If they are token vectors, you need to supply your own type counting function, which will be a bit slower tapply(BB$word2, BB$word1, function (x) length(unique(x))) On my machine, this takes about 0.2s for 770,000 word pairs. BTW, you might want to take a look at Unit 4 of the SIGIL course http://sigil.r-forge.r-project.org/ which has some tips on how you can deal efficiently with co-occurrence data in R. Best, Stefan ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.