> -----Original Message----- > From: William Dunlap > Sent: Thursday, November 25, 2010 9:31 AM > To: 'randomcz'; r-help@r-project.org > Subject: RE: [R] help: program efficiency > > If the input vector t is known to be ordered > (or if you only care about runs of duplicated > values, not all duplicated values) the following > is pretty quick > > nodup3 <- function (t) { > t + (sequence(rle(t)$lengths) - 1)/100 > } > > If you don't know if the the input will be ordered > then ave() will do it a bit faster than your > code > > nodup2 <- function (t) { > ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) > } > > E.g., for a sorted sequence of 300,000 numbers drawn with > replacement from 1:100,000 I get: > > > a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE)) > > system.time(v <- nodup(a2)) > user system elapsed > 2.78 0.05 3.97 > > system.time(v2 <- nodup2(a2)) > user system elapsed > 1.83 0.02 2.66 > > system.time(v3 <- nodup3(a2)) > user system elapsed > 0.18 0.00 0.14 > > identical(v,v2) && identical(v,v3) > [1] TRUE > > If speed is truly an issue, the built-in sequence may > be replaced by a faster one that does the same thing: > > nodup3a <- function (t) { > faster.sequence <- function(nvec) { > seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), > nvec) > } > t + (faster.sequence(rle(t)$lengths) - 1)/100 > } > > That took 0.05 seconds on the a2 dataset and produced > identical results.
rle() computes a sort of second difference and nodup3a computes a cumsum on that second diffence, to get back to a first difference. The following avoids that wasted operation (along with rle's computation of the values component of its output). nodup4 <- function(t) { n <- length(t) p <- c(0L, which(t[-1L] != t[-n]), n) t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100 } That reduced nodup3a's time by about 30% on that dataset. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > > -----Original Message----- > > From: r-help-boun...@r-project.org > > [mailto:r-help-boun...@r-project.org] On Behalf Of randomcz > > Sent: Thursday, November 25, 2010 6:49 AM > > To: r-help@r-project.org > > Subject: [R] help: program efficiency > > > > > > hey guys, > > > > I am working on a function to make a duplicated value unique. > > For example, > > the original vector would be like : a = c(2,1,1,3,3,3,4) > > I'll like to transform it into: > > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 > > basically, find the duplicates and assign a unique value by > > adding a small > > amount and keep it in order. > > I come up with the following codes, but it runs slow if t is > > large. Is there > > a better way to do it? > > nodup = function(t) > > { > > t.index=0 > > t.dup=duplicated(t) > > for (i in 2:length(t)) > > { > > if (t.dup[i]==T) > > t.index=t.index+0.01 > > else t.index=0 > > t[i]=t[i]+t.index > > } > > return(t) > > } > > > > > > -- > > View this message in context: > > http://r.789695.n4.nabble.com/help-program-efficiency-tp305907 > 9p3059079.html > > Sent from the R help mailing list archive at Nabble.com. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.