Re: [R] help: program efficiency
) ) ) utilisateur système écoulé 0.162 0.011 0.172 system.time( nodup3a( sort( x ) ) ) utilisateur système écoulé 0.099 0.009 0.109 system.time( nodup4( sort( x ) ) ) utilisateur système écoulé 0.089 0.004 0.094 so nodup4 is still faster, but the values are not in the right order: x - c( 2, 1, 1, 2 ) nodup4( sort( x ) ) [1] 1.00 1.01 2.00 2.01 nodup_cpp( x ) [1] 2.00 1.00 1.01 2.01 Romain I think this gives a more fair comparison : system.time( nodup_cpp( x ) ) utilisateur système écoulé 0.113 0.002 0.114 system.time( { oo - order(order(x)) ; nodup3( sort( x ) )[oo] } ) utilisateur système écoulé 0.336 0.012 0.347 system.time( { oo - order(order(x)) ; nodup3a( sort( x ) )[oo] } ) utilisateur système écoulé 0.251 0.011 0.262 system.time( { oo - order(order(x)) ; nodup4( sort( x ) )[oo] } ) utilisateur système écoulé 0.287 0.006 0.294 Romain Le 26/11/10 20:01, William Dunlap a écrit : -Original Message- From: William Dunlap Sent: Thursday, November 25, 2010 9:31 AM To: 'randomcz'; r-help@r-project.org Subject: RE: [R] help: program efficiency If the input vector t is known to be ordered (or if you only care about runs of duplicated values, not all duplicated values) the following is pretty quick nodup3- function (t) { t + (sequence(rle(t)$lengths) - 1)/100 } If you don't know if the the input will be ordered then ave() will do it a bit faster than your code nodup2- function (t) { ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) } E.g., for a sorted sequence of 300,000 numbers drawn with replacement from 1:100,000 I get: a2- sort(sample(1:1e5, size=3e5, replace=TRUE)) system.time(v- nodup(a2)) user system elapsed 2.78 0.05 3.97 system.time(v2- nodup2(a2)) user system elapsed 1.83 0.02 2.66 system.time(v3- nodup3(a2)) user system elapsed 0.18 0.00 0.14 identical(v,v2) identical(v,v3) [1] TRUE If speed is truly an issue, the built-in sequence may be replaced by a faster one that does the same thing: nodup3a- function (t) { faster.sequence- function(nvec) { seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), nvec) } t + (faster.sequence(rle(t)$lengths) - 1)/100 } That took 0.05 seconds on the a2 dataset and produced identical results. rle() computes a sort of second difference and nodup3a computes a cumsum on that second diffence, to get back to a first difference. The following avoids that wasted operation (along with rle's computation of the values component of its output). nodup4- function(t) { n- length(t) p- c(0L, which(t[-1L] != t[-n]), n) t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100 } That reduced nodup3a's time by about 30% on that dataset. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of randomcz Sent: Thursday, November 25, 2010 6:49 AM To: r-help@r-project.org Subject: [R] help: program efficiency hey guys, I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) I'll like to transform it into: a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 basically, find the duplicates and assign a unique value by adding a small amount and keep it in order. I come up with the following codes, but it runs slow if t is large. Is there a better way to do it? nodup = function(t) { t.index=0 t.dup=duplicated(t) for (i in 2:length(t)) { if (t.dup[i]==T) t.index=t.index+0.01 else t.index=0 t[i]=t[i]+t.index } return(t) } -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp305907 9p3059079.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Romain Francois Professional R Enthusiast +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr |- http://bit.ly/9VOd3l : ZAT! 2010 |- http://bit.ly/c6DzuX : Impressionnism with R `- http://bit.ly/czHPM7 : Rcpp Google tech talk on youtube __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
So in this example, it seems more efficient to sort first and use the algorithm assuming that the data is sorted. There is probably a way to be smarter in nodup_cpp where the bottleneck is likely to be related to map::find. If you just use a hash table, std::map should work too, I don't see what there is to sort, see my earlier post. You do however need to be careful about sum-of-pieces timing especially if you ever end up in VM. Memory coherence can be a big deal, removing a sort can slow other things down later in some cases. I hate to ask but those variables foo[i] are not maps are they? If you care about efficiency you should be using arrays here, IIRC map has to handle these as sparse arrays and that slows things down. However, if you made a map of prior occurences of each value, foo[v[i]] that may be faster than doing a sort hard to say. Profiling reveals this: Rprof() for(i in 1:100) { res6 - ( nodup_cpp_hybrid( x, sort.list(x) ) ) } Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct sort.list 6.50 90.03 6.50 90.03 .Call 0.42 5.82 0.42 5.82 file.exists 0.30 4.16 0.30 4.16 $by.total total.time total.pct self.time self.pct nodup_cpp_hybrid 7.22 100.00 0.00 0.00 sort.list 6.50 90.03 6.50 90.03 .Call 0.42 5.82 0.42 5.82 file.exists 0.30 4.16 0.30 4.16 $sample.interval [1] 0.02 $sampling.time [1] 7.22 The 4.16 % taken by file.exists indicates that someone in the inline project has to do some work (on my TODO list). I've never used the R profiler but according to docs on 'dohs this is wall clock time. Time blocking for IO may dominate depending on how filesystem works. I often do point out that IO can dominate things that everyone is expecting to be CPU bound- this often comes up with cygwin where you have another layer of stuff over the OS but can happen anywhere. But otherwise sort.list dominates the time. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
-Original Message- From: William Dunlap Sent: Thursday, November 25, 2010 9:31 AM To: 'randomcz'; r-help@r-project.org Subject: RE: [R] help: program efficiency If the input vector t is known to be ordered (or if you only care about runs of duplicated values, not all duplicated values) the following is pretty quick nodup3 - function (t) { t + (sequence(rle(t)$lengths) - 1)/100 } If you don't know if the the input will be ordered then ave() will do it a bit faster than your code nodup2 - function (t) { ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) } E.g., for a sorted sequence of 300,000 numbers drawn with replacement from 1:100,000 I get: a2 - sort(sample(1:1e5, size=3e5, replace=TRUE)) system.time(v - nodup(a2)) user system elapsed 2.780.053.97 system.time(v2 - nodup2(a2)) user system elapsed 1.830.022.66 system.time(v3 - nodup3(a2)) user system elapsed 0.180.000.14 identical(v,v2) identical(v,v3) [1] TRUE If speed is truly an issue, the built-in sequence may be replaced by a faster one that does the same thing: nodup3a - function (t) { faster.sequence - function(nvec) { seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), nvec) } t + (faster.sequence(rle(t)$lengths) - 1)/100 } That took 0.05 seconds on the a2 dataset and produced identical results. rle() computes a sort of second difference and nodup3a computes a cumsum on that second diffence, to get back to a first difference. The following avoids that wasted operation (along with rle's computation of the values component of its output). nodup4 - function(t) { n - length(t) p - c(0L, which(t[-1L] != t[-n]), n) t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100 } That reduced nodup3a's time by about 30% on that dataset. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of randomcz Sent: Thursday, November 25, 2010 6:49 AM To: r-help@r-project.org Subject: [R] help: program efficiency hey guys, I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) I'll like to transform it into: a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 basically, find the duplicates and assign a unique value by adding a small amount and keep it in order. I come up with the following codes, but it runs slow if t is large. Is there a better way to do it? nodup = function(t) { t.index=0 t.dup=duplicated(t) for (i in 2:length(t)) { if (t.dup[i]==T) t.index=t.index+0.01 else t.index=0 t[i]=t[i]+t.index } return(t) } -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp305907 9p3059079.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
See if this works for you. a - c(2,1,1,3,3,3,4) a.fac - as.factor(a) b - split(a, f = a.fac) system.time(lapply(X = b, FUN = function(x) { swn - seq(from = 0, to = 0 + 0.01*length(x), by = 0.01) out - x + swn return(out) })) Cheers, Roman -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060801.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
Oops, tiny mistake. Try lapply(X = b, FUN = function(x) { swn - seq(from = 0, to = (0 + 0.01*length(x))-0.01, by = 0.01) out - x + swn return(out) }) -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060806.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
Hello, Can we really make the assumption that the data is sorted. The original example was not: I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) If we can make the assumption, here is a C++ based version: nodup_cpp_assumingsorted - cxxfunction( signature( x_ = numeric ), ' // since we modify x, we need to make a copy NumericVector x = cloneNumericVector(x_); int n = x.size() ; double current, previous = x[0] ; int index ; for( int i=1; in; i++){ current = x[i] ; if( current == previous ){ x[i] = current + (++index) / 100.0 ; } else { index = 0 ; } previous = current ; } return x ; ', plugin = Rcpp ) with these results: x - sort( sample( 1:10, size = 30, replace = TRUE ) ) system.time( nodup3( x ) ) utilisateur système écoulé 0.090 0.004 0.094 system.time( nodup3a( x ) ) utilisateur système écoulé 0.036 0.005 0.040 system.time( nodup4( x ) ) utilisateur système écoulé 0.025 0.004 0.029 system.time( nodup_cpp_assumingsorted( x) ) utilisateur système écoulé 0.003 0.001 0.004 Now, if we don't make the assumption that the data is sorted, here is another C++ based version: require( inline ) require( Rcpp ) nodup_cpp - cxxfunction( signature( x_ = numeric ), ' // since we modify x, we need to make a copy NumericVector x = cloneNumericVector(x_); typedef std::mapdouble,int imap ; typedef imap::value_type pair ; imap index ; int n = x.size() ; double current, previous = x[0] ; index.insert( pair( previous, 0 ) ); imap::iterator it = index.begin() ; for( int i=1; in; i++){ current = x[i] ; if( current == previous ){ x[i] = current + ( ++(it-second) / 100.0 ) ; } else { it = index.find(current) ; if( it == index.end() ){ it = index.insert( current previous ? it : index.begin(), pair( current, 0 ) ) ; } else { x[i] = current + ( ++(it-second) / 100.0 ) ; } previous = current ; } } return x ; ', plugin = Rcpp ) which gives me this : x - sample( 1:10, size = 30, replace = TRUE ) system.time( nodup_cpp( x ) ) utilisateur système écoulé 0.111 0.002 0.113 system.time( nodup3( sort( x ) ) ) utilisateur système écoulé 0.162 0.011 0.172 system.time( nodup3a( sort( x ) ) ) utilisateur système écoulé 0.099 0.009 0.109 system.time( nodup4( sort( x ) ) ) utilisateur système écoulé 0.089 0.004 0.094 so nodup4 is still faster, but the values are not in the right order: x - c( 2, 1, 1, 2 ) nodup4( sort( x ) ) [1] 1.00 1.01 2.00 2.01 nodup_cpp( x ) [1] 2.00 1.00 1.01 2.01 Romain Le 26/11/10 20:01, William Dunlap a écrit : -Original Message- From: William Dunlap Sent: Thursday, November 25, 2010 9:31 AM To: 'randomcz'; r-help@r-project.org Subject: RE: [R] help: program efficiency If the input vector t is known to be ordered (or if you only care about runs of duplicated values, not all duplicated values) the following is pretty quick nodup3- function (t) { t + (sequence(rle(t)$lengths) - 1)/100 } If you don't know if the the input will be ordered then ave() will do it a bit faster than your code nodup2- function (t) { ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) } E.g., for a sorted sequence of 300,000 numbers drawn with replacement from 1:100,000 I get: a2- sort(sample(1:1e5, size=3e5, replace=TRUE)) system.time(v- nodup(a2)) user system elapsed 2.780.053.97 system.time(v2- nodup2(a2)) user system elapsed 1.830.022.66 system.time(v3- nodup3(a2)) user system elapsed 0.180.000.14 identical(v,v2) identical(v,v3) [1] TRUE If speed is truly an issue, the built-in sequence may be replaced by a faster one that does the same thing: nodup3a- function (t) { faster.sequence- function(nvec) { seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), nvec) } t + (faster.sequence(rle(t)$lengths) - 1)/100 } That took 0.05 seconds on the a2 dataset and produced identical results. rle() computes a sort of second difference and nodup3a computes a cumsum on that second diffence, to get back to a first difference. The following avoids that wasted operation (along with rle's computation of the values component of its output). nodup4- function(t) { n- length(t) p- c(0L, which(t[-1L] != t[-n]), n) t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100 } That reduced nodup3a's time by about 30
Re: [R] help: program efficiency
Le 26/11/10 21:13, Romain Francois a écrit : Hello, Can we really make the assumption that the data is sorted. The original example was not: I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) If we can make the assumption, here is a C++ based version: nodup_cpp_assumingsorted - cxxfunction( signature( x_ = numeric ), ' // since we modify x, we need to make a copy NumericVector x = cloneNumericVector(x_); int n = x.size() ; double current, previous = x[0] ; int index ; for( int i=1; in; i++){ current = x[i] ; if( current == previous ){ x[i] = current + (++index) / 100.0 ; } else { index = 0 ; } previous = current ; } return x ; ', plugin = Rcpp ) with these results: x - sort( sample( 1:10, size = 30, replace = TRUE ) ) system.time( nodup3( x ) ) utilisateur système écoulé 0.090 0.004 0.094 system.time( nodup3a( x ) ) utilisateur système écoulé 0.036 0.005 0.040 system.time( nodup4( x ) ) utilisateur système écoulé 0.025 0.004 0.029 system.time( nodup_cpp_assumingsorted( x) ) utilisateur système écoulé 0.003 0.001 0.004 Now, if we don't make the assumption that the data is sorted, here is another C++ based version: require( inline ) require( Rcpp ) nodup_cpp - cxxfunction( signature( x_ = numeric ), ' // since we modify x, we need to make a copy NumericVector x = cloneNumericVector(x_); typedef std::mapdouble,int imap ; typedef imap::value_type pair ; imap index ; int n = x.size() ; double current, previous = x[0] ; index.insert( pair( previous, 0 ) ); imap::iterator it = index.begin() ; for( int i=1; in; i++){ current = x[i] ; if( current == previous ){ x[i] = current + ( ++(it-second) / 100.0 ) ; } else { it = index.find(current) ; if( it == index.end() ){ it = index.insert( current previous ? it : index.begin(), pair( current, 0 ) ) ; } else { x[i] = current + ( ++(it-second) / 100.0 ) ; } previous = current ; } } return x ; ', plugin = Rcpp ) which gives me this : x - sample( 1:10, size = 30, replace = TRUE ) system.time( nodup_cpp( x ) ) utilisateur système écoulé 0.111 0.002 0.113 system.time( nodup3( sort( x ) ) ) utilisateur système écoulé 0.162 0.011 0.172 system.time( nodup3a( sort( x ) ) ) utilisateur système écoulé 0.099 0.009 0.109 system.time( nodup4( sort( x ) ) ) utilisateur système écoulé 0.089 0.004 0.094 so nodup4 is still faster, but the values are not in the right order: x - c( 2, 1, 1, 2 ) nodup4( sort( x ) ) [1] 1.00 1.01 2.00 2.01 nodup_cpp( x ) [1] 2.00 1.00 1.01 2.01 Romain I think this gives a more fair comparison : system.time( nodup_cpp( x ) ) utilisateur système écoulé 0.113 0.002 0.114 system.time( { oo - order(order(x)) ; nodup3( sort( x ) )[oo] } ) utilisateur système écoulé 0.336 0.012 0.347 system.time( { oo - order(order(x)) ; nodup3a( sort( x ) )[oo] } ) utilisateur système écoulé 0.251 0.011 0.262 system.time( { oo - order(order(x)) ; nodup4( sort( x ) )[oo] } ) utilisateur système écoulé 0.287 0.006 0.294 Romain Le 26/11/10 20:01, William Dunlap a écrit : -Original Message- From: William Dunlap Sent: Thursday, November 25, 2010 9:31 AM To: 'randomcz'; r-help@r-project.org Subject: RE: [R] help: program efficiency If the input vector t is known to be ordered (or if you only care about runs of duplicated values, not all duplicated values) the following is pretty quick nodup3- function (t) { t + (sequence(rle(t)$lengths) - 1)/100 } If you don't know if the the input will be ordered then ave() will do it a bit faster than your code nodup2- function (t) { ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) } E.g., for a sorted sequence of 300,000 numbers drawn with replacement from 1:100,000 I get: a2- sort(sample(1:1e5, size=3e5, replace=TRUE)) system.time(v- nodup(a2)) user system elapsed 2.78 0.05 3.97 system.time(v2- nodup2(a2)) user system elapsed 1.83 0.02 2.66 system.time(v3- nodup3(a2)) user system elapsed 0.18 0.00 0.14 identical(v,v2) identical(v,v3) [1] TRUE If speed is truly an issue, the built-in sequence may be replaced by a faster one that does the same thing: nodup3a- function (t) { faster.sequence- function(nvec) { seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), nvec) } t + (faster.sequence(rle(t)$lengths) - 1)/100 } That took 0.05 seconds on the a2 dataset and produced identical results. rle() computes a sort of second difference and nodup3a computes a cumsum on that second diffence, to get back to a first difference. The following avoids that wasted operation (along with rle's computation of the values component of its output). nodup4- function(t) { n- length(t) p- c(0L, which(t[-1L] != t[-n]), n) t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100 } That reduced nodup3a's time by about 30% on that dataset. Bill Dunlap
Re: [R] help: program efficiency
Date: Fri, 26 Nov 2010 11:25:26 -0800 From: roman.lust...@gmail.com To: r-help@r-project.org Subject: Re: [R] help: program efficiency Oops, tiny mistake. Try lapply(X = b, FUN = function(x) { swn - seq(from = 0, to = (0 + 0.01*length(x))-0.01, by = 0.01) out - x + swn return(out) }) -- The way the OP stated the question, it wasn't clear what he was really after and what he thought was a concession to what is easy. That is, it may seem easy to add incrementing offsets to lift degeneracy but my earlier suggestion, just adding random numbers, is probably the shortest R code you can make and should do what is needed. If you really want to add a fixed amount to possibly equal integers, probably the easiest thing to do is make a hash table, does R have a hash structure? Then, for each element, increment the value in your hash for the element value and add it to the element. The hash table is keyed on integer value and the hash key retrieves a value of either last increment or number of prior occurences. If you had to write this as a loop, pseudo code would be something like this for i=0 to n { vold=hash.get(v[i]); vnew=vold+.01; hash.put(v[i],vnew); v[i]+=vnew; } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
Thanks guys, the rle function works pretty well. Thank you all for the efforts. Zheng -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3061103.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
one way is the following: a - c(2,1,1,3,3,3,4) d - unlist(sapply(rle(a)$length, function (x) if (x 1) seq(0.01, by = 0.01, len = x) else 0)) a + d I hope it helps. Best, Dimitris On 11/25/2010 3:49 PM, randomcz wrote: hey guys, I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) I'll like to transform it into: a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 basically, find the duplicates and assign a unique value by adding a small amount and keep it in order. I come up with the following codes, but it runs slow if t is large. Is there a better way to do it? nodup = function(t) { t.index=0 t.dup=duplicated(t) for (i in 2:length(t)) { if (t.dup[i]==T) t.index=t.index+0.01 else t.index=0 t[i]=t[i]+t.index } return(t) } -- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014 Web: http://www.erasmusmc.nl/biostatistiek/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
If the input vector t is known to be ordered (or if you only care about runs of duplicated values, not all duplicated values) the following is pretty quick nodup3 - function (t) { t + (sequence(rle(t)$lengths) - 1)/100 } If you don't know if the the input will be ordered then ave() will do it a bit faster than your code nodup2 - function (t) { ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) } E.g., for a sorted sequence of 300,000 numbers drawn with replacement from 1:100,000 I get: a2 - sort(sample(1:1e5, size=3e5, replace=TRUE)) system.time(v - nodup(a2)) user system elapsed 2.780.053.97 system.time(v2 - nodup2(a2)) user system elapsed 1.830.022.66 system.time(v3 - nodup3(a2)) user system elapsed 0.180.000.14 identical(v,v2) identical(v,v3) [1] TRUE If speed is truly an issue, the built-in sequence may be replaced by a faster one that does the same thing: nodup3a - function (t) { faster.sequence - function(nvec) { seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), nvec) } t + (faster.sequence(rle(t)$lengths) - 1)/100 } That took 0.05 seconds on the a2 dataset and produced identical results. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of randomcz Sent: Thursday, November 25, 2010 6:49 AM To: r-help@r-project.org Subject: [R] help: program efficiency hey guys, I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) I'll like to transform it into: a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 basically, find the duplicates and assign a unique value by adding a small amount and keep it in order. I come up with the following codes, but it runs slow if t is large. Is there a better way to do it? nodup = function(t) { t.index=0 t.dup=duplicated(t) for (i in 2:length(t)) { if (t.dup[i]==T) t.index=t.index+0.01 else t.index=0 t[i]=t[i]+t.index } return(t) } -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp305907 9p3059079.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] help: program efficiency
Date: Thu, 25 Nov 2010 06:49:19 -0800 From: rando...@gmail.com To: r-help@r-project.org Subject: [R] help: program efficiency hey guys, I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) I'll like to transform it into: a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 basically, find the duplicates and assign a unique value by adding a small amount and keep it in order. I come up with the following codes, but it runs slow if t is large. Is there a better way to do it? I guess I'd just make a vector of uniform or even normal random numbers and add to your input vector. This of course is not guaranteed and adds to uniques but you can test and repeat and it is probably closer to what you want but I am only speculating on your objectives. nodup = function(t) { t.index=0 t.dup=duplicated(t) for (i in 2:length(t)) { if (t.dup[i]==T) t.index=t.index+0.01 else t.index=0 t[i]=t[i]+t.index } return(t) } -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.