The data.table package might be of use to you, but lacking a reproducible example [1] I think I will leave figuring out just how to you.

Being on Nabble you may not be able to see the footer appended to every message on this MAILING LIST. For your benefit, here it is:

* R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
* https://stat.ethz.ch/mailman/listinfo/r-help
* PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
* and provide commented, minimal, self-contained, reproducible code.

[1] 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

On Mon, 8 Dec 2014, apeshifter wrote:

Dear all,

for the past two weeks, I've been working on a script to retrieve word pairs
and calculate some of their statistics using R. Everything seemed to work
fine until I switched from a small test dataset to the 'real thing' and
noticed what a runtime monster I had devised!

I could reduce processing time significantly when I realized that with R, I
did not have to do everything in loops and count things vector element by
vector element, but could just have the program count everything with
tables, e.g. with
 > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]

However, now I seem to have run into a performance problem that I cannot
solve. I hope there's a kind soul on this list who has some advice for me.
On to the problem:

The last relic of the afore-mentioned for-loop that goes through all the
word pairs and tries to calculate some statistics on them is the following
line of code:
 > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
(where word1 and word2 are the first and second word within the two-word
sequence (all.word.pairs, above)

Here, I am trying to count the number of 'types', linguistically speaking,
before the second word in the two-word sequence (later, I am doing the same
for the first word within the sequence). The expression works, but given my
~400,000 word pairs/word1's/word2's etc, this takes quite some time. About
10 hours on my machine, in fact, since R cannot use the other three of the
four cores. Since I want to repeat the process for another 20 corpora of
similar size, I would definitely appreciate some help on this subject.

I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and the
subset() function and both seem to work (though I haven't checked whether
all the numbers are in fact correctly calculated), but they take about the
same amount of time. So that's no use for me.

Does anybody have any tips to speed this up?

Thank you very much!



--
View this message in context: 
http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to