Re: [R] how to efficiently compute set unique?
Hi All, I think I figured out what's the problem. I have been a matlab user, so in all my codes, I maintain the as.matrix format, which is much slower to do unique. I tried to not do the as.matrix conversion, and now it takes just few seconds to do unique, as well as other computations. Thanks a lot Duncan, Steve, David, and Douglas, Hopefully, this case can also help future matlab->R users who got stucked in the matlab thinking style. Gang On Mon, Jun 21, 2010 at 7:01 PM, Douglas Bates wrote: > On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius > wrote: >> >> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote: >> >>> On 21/06/2010 9:06 PM, G FANG wrote: Hi, I want to get the unique set from a large numeric k by 1 vector, k is in tens of millions when I used the matlab function unique, it takes less than 10 secs but when I tried to use the unique in R with similar CPU and memory, it is not done in minutes I am wondering, am I using the function in the right way? dim(cntxtn) [1] 13584763 1 uniqueCntxt = unique(cntxtn); # this is taking really long >>> >>> What type is cntxtn? If I do that sort of thing on a numeric vector, it's >>> quite fast: >>> >>> > x <- sample(10, size=13584763, replace=T) >>> > system.time(unique(x)) >>> user system elapsed >>> 3.61 0.14 3.75 >> >> If it's a factor, it could be as simple as: >> >> levels(cntxtn) # since the work of "unique-ification" has already been >> done. > > Not quite. When you generate a factor, as you do in your example, the > levels correspond to the unique values of the original vector. But > when you take a subset of a factor the levels are preserved intact, > even if some of those levels do not occur in the subset. This is why > there are unusual arguments with names like drop.unused.levels in > functions like model.frame. It is also a subtle difference in the > behavior of factor(x) and as.factor(x) when x is already a factor. > >> ff <- factor(sample.int(200, 1000, replace = TRUE)) >> ff1 <- ff[1:40] >> length(levels(ff)) > [1] 199 >> length(levels(ff1)) > [1] 199 >> length(levels(as.factor(ff1))) > [1] 199 >> length(levels(factor(ff1))) > [1] 34 > >>> x <- factor(sample(10, size=13584763, replace=T)) >>> system.time(levels(x)) >> user system elapsed >> 0 0 0 >>> system.time(y <- levels(x)) >> user system elapsed >> 0 0 0 > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to efficiently compute set unique?
On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius wrote: > > On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote: > >> On 21/06/2010 9:06 PM, G FANG wrote: >>> >>> Hi, >>> >>> I want to get the unique set from a large numeric k by 1 vector, k is >>> in tens of millions >>> >>> when I used the matlab function unique, it takes less than 10 secs >>> >>> but when I tried to use the unique in R with similar CPU and memory, >>> it is not done in minutes >>> >>> I am wondering, am I using the function in the right way? >>> >>> dim(cntxtn) >>> [1] 13584763 1 >>> uniqueCntxt = unique(cntxtn); # this is taking really long >> >> What type is cntxtn? If I do that sort of thing on a numeric vector, it's >> quite fast: >> >> > x <- sample(10, size=13584763, replace=T) >> > system.time(unique(x)) >> user system elapsed >> 3.61 0.14 3.75 > > If it's a factor, it could be as simple as: > > levels(cntxtn) # since the work of "unique-ification" has already been > done. Not quite. When you generate a factor, as you do in your example, the levels correspond to the unique values of the original vector. But when you take a subset of a factor the levels are preserved intact, even if some of those levels do not occur in the subset. This is why there are unusual arguments with names like drop.unused.levels in functions like model.frame. It is also a subtle difference in the behavior of factor(x) and as.factor(x) when x is already a factor. > ff <- factor(sample.int(200, 1000, replace = TRUE)) > ff1 <- ff[1:40] > length(levels(ff)) [1] 199 > length(levels(ff1)) [1] 199 > length(levels(as.factor(ff1))) [1] 199 > length(levels(factor(ff1))) [1] 34 >> x <- factor(sample(10, size=13584763, replace=T)) >> system.time(levels(x)) > user system elapsed > 0 0 0 >> system.time(y <- levels(x)) > user system elapsed > 0 0 0 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to efficiently compute set unique?
The original question was about a matrix, not a vector and this is much slower: x <- sample(10, size=13584763, replace=T) dim(x) <- c(13584763, 1) system.time(unique(x)) So the solution would be: unique(as.vector(x)) >>> From: Duncan Murdoch To:G FANG CC: Date: 22/Jun/2010 1:20p Subject: Re: [R] how to efficiently compute set unique? On 21/06/2010 9:06 PM, G FANG wrote: > Hi, > > I want to get the unique set from a large numeric k by 1 vector, k is > in tens of millions > > when I used the matlab function unique, it takes less than 10 secs > > but when I tried to use the unique in R with similar CPU and memory, > it is not done in minutes > > I am wondering, am I using the function in the right way? > > dim(cntxtn) > [1] 135847631 > uniqueCntxt = unique(cntxtn);# this is taking really long What type is cntxtn? If I do that sort of thing on a numeric vector, it's quite fast: > x <- sample(10, size=13584763, replace=T) > system.time(unique(x)) user system elapsed 3.610.143.75 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R ( http://www.r/ )-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to efficiently compute set unique?
On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote: On 21/06/2010 9:06 PM, G FANG wrote: Hi, I want to get the unique set from a large numeric k by 1 vector, k is in tens of millions when I used the matlab function unique, it takes less than 10 secs but when I tried to use the unique in R with similar CPU and memory, it is not done in minutes I am wondering, am I using the function in the right way? dim(cntxtn) [1] 135847631 uniqueCntxt = unique(cntxtn);# this is taking really long What type is cntxtn? If I do that sort of thing on a numeric vector, it's quite fast: > x <- sample(10, size=13584763, replace=T) > system.time(unique(x)) user system elapsed 3.610.143.75 If it's a factor, it could be as simple as: levels(cntxtn) # since the work of "unique-ification" has already been done. > x <- factor(sample(10, size=13584763, replace=T)) > system.time(levels(x)) user system elapsed 0 0 0 > system.time(y <- levels(x)) user system elapsed 0 0 0 -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to efficiently compute set unique?
On 21/06/2010 9:06 PM, G FANG wrote: Hi, I want to get the unique set from a large numeric k by 1 vector, k is in tens of millions when I used the matlab function unique, it takes less than 10 secs but when I tried to use the unique in R with similar CPU and memory, it is not done in minutes I am wondering, am I using the function in the right way? dim(cntxtn) [1] 135847631 uniqueCntxt = unique(cntxtn);# this is taking really long What type is cntxtn? If I do that sort of thing on a numeric vector, it's quite fast: > x <- sample(10, size=13584763, replace=T) > system.time(unique(x)) user system elapsed 3.610.143.75 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.