Re: [R] help: program efficiency

2010-11-27 Thread Romain Francois
 ) ) )
utilisateur système écoulé
0.162 0.011 0.172
 system.time( nodup3a( sort( x ) ) )
utilisateur système écoulé
0.099 0.009 0.109
 system.time( nodup4( sort( x ) ) )
utilisateur système écoulé
0.089 0.004 0.094

so nodup4 is still faster, but the values are not in the right order:

 x - c( 2, 1, 1, 2 )
 nodup4( sort( x ) )
[1] 1.00 1.01 2.00 2.01
 nodup_cpp( x )
[1] 2.00 1.00 1.01 2.01

Romain


I think this gives a more fair comparison :

  system.time( nodup_cpp( x ) )
utilisateur système écoulé
0.113 0.002 0.114
  system.time( { oo - order(order(x)) ; nodup3( sort( x ) )[oo] } )
utilisateur système écoulé
0.336 0.012 0.347
  system.time( { oo - order(order(x)) ; nodup3a( sort( x ) )[oo] } )
utilisateur système écoulé
0.251 0.011 0.262
  system.time( { oo - order(order(x)) ; nodup4( sort( x ) )[oo] } )
utilisateur système écoulé
0.287 0.006 0.294


Romain


Le 26/11/10 20:01, William Dunlap a écrit :



-Original Message-
From: William Dunlap
Sent: Thursday, November 25, 2010 9:31 AM
To: 'randomcz'; r-help@r-project.org
Subject: RE: [R] help: program efficiency

If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3- function (t) {
t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2- function (t) {
ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:


a2- sort(sample(1:1e5, size=3e5, replace=TRUE))
system.time(v- nodup(a2))

user system elapsed
2.78 0.05 3.97

system.time(v2- nodup2(a2))

user system elapsed
1.83 0.02 2.66

system.time(v3- nodup3(a2))

user system elapsed
0.18 0.00 0.14

identical(v,v2) identical(v,v3)

[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a- function (t) {
faster.sequence- function(nvec) {
seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
nvec)
}
t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.


rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference. The following
avoids that wasted operation (along with rle's
computation of the values component of its output).

nodup4- function(t) {
n- length(t)
p- c(0L, which(t[-1L] != t[-n]), n)
t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}

That reduced nodup3a's time by about 30% on that dataset.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


-Original Message-
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of randomcz
Sent: Thursday, November 25, 2010 6:49 AM
To: r-help@r-project.org
Subject: [R] help: program efficiency


hey guys,

I am working on a function to make a duplicated value unique.
For example,
the original vector would be like : a = c(2,1,1,3,3,3,4)
I'll like to transform it into:
a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
basically, find the duplicates and assign a unique value by
adding a small
amount and keep it in order.
I come up with the following codes, but it runs slow if t is
large. Is there
a better way to do it?
nodup = function(t)
{
t.index=0
t.dup=duplicated(t)
for (i in 2:length(t))
{
if (t.dup[i]==T)
t.index=t.index+0.01
else t.index=0
t[i]=t[i]+t.index
}
return(t)
}


--
View this message in context:
http://r.789695.n4.nabble.com/help-program-efficiency-tp305907

9p3059079.html

Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.











--
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/9VOd3l : ZAT! 2010
|- http://bit.ly/c6DzuX : Impressionnism with R
`- http://bit.ly/czHPM7 : Rcpp Google tech talk on youtube

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-27 Thread Mike Marchywka




 So in this example, it seems more efficient to sort first and use the
 algorithm assuming that the data is sorted.

 There is probably a way to be smarter in nodup_cpp where the bottleneck
 is likely to be related to map::find.

If you just use a hash table, std::map should work too,
I don't see what there is to sort, see my earlier
post. You do however need to be careful about sum-of-pieces timing
especially if you ever end up in VM. Memory coherence can be a big deal,
removing a sort can slow other things down later in some cases.
I hate to ask but those variables foo[i] are not maps are they?
If you care about efficiency you should be using arrays here, IIRC
map has to handle these as sparse arrays and that slows things down.
However, if you made a map of prior occurences of each value, 
foo[v[i]] that may be faster than doing a sort hard to say.



 Profiling reveals this:

 Rprof()
 for(i in 1:100) { res6 - ( nodup_cpp_hybrid( x, sort.list(x) ) ) }
 Rprof(NULL)
 summaryRprof()
 $by.self
 self.time self.pct total.time total.pct
 sort.list 6.50 90.03 6.50 90.03
 .Call 0.42 5.82 0.42 5.82
 file.exists 0.30 4.16 0.30 4.16

 $by.total
 total.time total.pct self.time self.pct
 nodup_cpp_hybrid 7.22 100.00 0.00 0.00
 sort.list 6.50 90.03 6.50 90.03
 .Call 0.42 5.82 0.42 5.82
 file.exists 0.30 4.16 0.30 4.16

 $sample.interval
 [1] 0.02

 $sampling.time
 [1] 7.22


 The 4.16 % taken by file.exists indicates that someone in the inline
 project has to do some work (on my TODO list).

I've never used the R profiler but according to docs on 'dohs this is wall clock
time. Time blocking for IO may dominate depending on how filesystem works.
I often do point out that IO can dominate things that everyone is expecting
to be CPU bound- this often comes up with cygwin where you have another layer
of stuff over the OS but can happen anywhere.



 But otherwise sort.list dominates the time.
 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-26 Thread William Dunlap
 -Original Message-
 From: William Dunlap 
 Sent: Thursday, November 25, 2010 9:31 AM
 To: 'randomcz'; r-help@r-project.org
 Subject: RE: [R] help: program efficiency
 
 If the input vector t is known to be ordered
 (or if you only care about runs of duplicated
 values, not all duplicated values) the following
 is pretty quick
 
 nodup3 - function (t) { 
 t + (sequence(rle(t)$lengths) - 1)/100
 }
 
 If you don't know if the the input will be ordered
 then ave() will do it a bit faster than your
 code
 
 nodup2 - function (t) { 
 ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
 }
 
 E.g., for a sorted sequence of 300,000 numbers drawn with
 replacement from 1:100,000 I get:
 
  a2 - sort(sample(1:1e5, size=3e5, replace=TRUE))
  system.time(v - nodup(a2))
user  system elapsed 
2.780.053.97 
  system.time(v2 - nodup2(a2))
user  system elapsed 
1.830.022.66 
  system.time(v3 - nodup3(a2))
user  system elapsed 
0.180.000.14 
  identical(v,v2)  identical(v,v3)
 [1] TRUE
 
 If speed is truly an issue, the built-in sequence may
 be replaced by a faster one that does the same thing:
 
 nodup3a - function (t) {
 faster.sequence - function(nvec) {
 seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
 nvec)
 }
 t + (faster.sequence(rle(t)$lengths) - 1)/100
 }
 
 That took 0.05 seconds on the a2 dataset and produced
 identical results.

rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference.  The following
avoids that wasted operation (along with rle's
computation of the values component of its output).

nodup4 - function(t) {
n - length(t)
p - c(0L, which(t[-1L] != t[-n]), n)
t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}

That reduced nodup3a's time by about 30% on that dataset.
 
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
 
  -Original Message-
  From: r-help-boun...@r-project.org 
  [mailto:r-help-boun...@r-project.org] On Behalf Of randomcz
  Sent: Thursday, November 25, 2010 6:49 AM
  To: r-help@r-project.org
  Subject: [R] help: program efficiency
  
  
  hey guys,
  
  I am working on a function to make a duplicated value unique. 
  For example,
  the original vector would be like : a = c(2,1,1,3,3,3,4)
  I'll like to transform it into:
  a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
  basically, find the duplicates and assign a unique value by 
  adding a small
  amount and keep it in order.
  I come up with the following codes, but it runs slow if t is 
  large. Is there
  a better way to do it?
  nodup = function(t)
  {
t.index=0
t.dup=duplicated(t)
for (i in 2:length(t))
{
  if (t.dup[i]==T)
t.index=t.index+0.01
  else t.index=0
  t[i]=t[i]+t.index
}
return(t)
  }
  
  
  -- 
  View this message in context: 
  http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
 9p3059079.html
  Sent from the R help mailing list archive at Nabble.com.
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
  

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-26 Thread Roman Luštrik

See if this works for you.

a - c(2,1,1,3,3,3,4)
a.fac - as.factor(a)
b - split(a, f = a.fac)
system.time(lapply(X = b, FUN = function(x) {
swn - seq(from = 0, to = 0 + 0.01*length(x), by = 0.01)
out - x + swn
return(out)
}))

Cheers,
Roman
-- 
View this message in context: 
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060801.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-26 Thread Roman Luštrik

Oops, tiny mistake. Try

lapply(X = b, FUN = function(x) {
swn - seq(from = 0, to = (0 + 0.01*length(x))-0.01, by 
= 0.01)
out - x + swn
return(out)
})
-- 
View this message in context: 
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060806.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-26 Thread Romain Francois

Hello,

Can we really make the assumption that the data is sorted. The original 
example was not:



I am working on a function to make a duplicated value unique. For example,
the original vector would be like : a = c(2,1,1,3,3,3,4)


If we can make the assumption, here is a C++ based version:


nodup_cpp_assumingsorted - cxxfunction( signature( x_ = numeric ), '

// since we modify x, we need to make a copy
NumericVector x = cloneNumericVector(x_);

int n = x.size() ;
double current, previous = x[0] ;
int index ;
for( int i=1; in; i++){
current = x[i] ;
if( current == previous ){
x[i] = current + (++index) / 100.0 ;
} else {
index = 0 ;
}
previous = current ;
}
return x ;
', plugin = Rcpp )


with these results:

 x - sort( sample( 1:10, size = 30, replace = TRUE ) )

 system.time( nodup3( x ) )
utilisateur système  écoulé
  0.090   0.004   0.094
 system.time( nodup3a( x ) )
utilisateur système  écoulé
  0.036   0.005   0.040
 system.time( nodup4( x ) )
utilisateur système  écoulé
  0.025   0.004   0.029
 system.time( nodup_cpp_assumingsorted( x) )
utilisateur système  écoulé
  0.003   0.001   0.004



Now, if we don't make the assumption that the data is sorted, here is 
another C++ based version:


require( inline )
require( Rcpp )
nodup_cpp - cxxfunction( signature( x_ = numeric ), '

// since we modify x, we need to make a copy
NumericVector x = cloneNumericVector(x_);

typedef std::mapdouble,int imap ;
typedef imap::value_type pair ;
imap index ;
int n = x.size() ;
double current, previous = x[0] ;
index.insert( pair( previous, 0 ) );

imap::iterator it = index.begin() ;
for( int i=1; in; i++){
current = x[i] ;
if( current == previous ){
x[i] = current + ( ++(it-second) / 100.0 ) ;
} else {
it = index.find(current) ;
if( it == index.end() ){
it = index.insert(
current  previous ? it : index.begin(),
pair( current, 0 )
) ;
} else {
x[i] = current + ( ++(it-second) / 100.0 ) ;
}
 previous = current ;
}
}
return x ;
', plugin = Rcpp )


which gives me this :

 x - sample( 1:10, size = 30, replace = TRUE )

 system.time( nodup_cpp( x ) )
utilisateur système  écoulé
  0.111   0.002   0.113
 system.time( nodup3( sort( x ) ) )
utilisateur système  écoulé
  0.162   0.011   0.172
 system.time( nodup3a( sort( x ) ) )
utilisateur système  écoulé
  0.099   0.009   0.109
 system.time( nodup4( sort( x ) ) )
utilisateur système  écoulé
  0.089   0.004   0.094

so nodup4 is still faster, but the values are not in the right order:

 x - c( 2, 1, 1, 2 )
 nodup4( sort( x ) )
[1] 1.00 1.01 2.00 2.01
 nodup_cpp( x )
[1] 2.00 1.00 1.01 2.01

Romain

Le 26/11/10 20:01, William Dunlap a écrit :



-Original Message-
From: William Dunlap
Sent: Thursday, November 25, 2010 9:31 AM
To: 'randomcz'; r-help@r-project.org
Subject: RE: [R] help: program efficiency

If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3- function (t) {
 t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2- function (t) {
 ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:


a2- sort(sample(1:1e5, size=3e5, replace=TRUE))
system.time(v- nodup(a2))

user  system elapsed
2.780.053.97

system.time(v2- nodup2(a2))

user  system elapsed
1.830.022.66

system.time(v3- nodup3(a2))

user  system elapsed
0.180.000.14

identical(v,v2)  identical(v,v3)

[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a- function (t) {
 faster.sequence- function(nvec) {
 seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
 nvec)
 }
 t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.


rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference.  The following
avoids that wasted operation (along with rle's
computation of the values component of its output).

nodup4- function(t) {
 n- length(t)
 p- c(0L, which(t[-1L] != t[-n]), n)
 t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}

That reduced nodup3a's time by about 30

Re: [R] help: program efficiency

2010-11-26 Thread Romain Francois

Le 26/11/10 21:13, Romain Francois a écrit :


Hello,

Can we really make the assumption that the data is sorted. The original
example was not:


I am working on a function to make a duplicated value unique. For
example,
the original vector would be like : a = c(2,1,1,3,3,3,4)


If we can make the assumption, here is a C++ based version:


nodup_cpp_assumingsorted - cxxfunction( signature( x_ = numeric ), '

// since we modify x, we need to make a copy
NumericVector x = cloneNumericVector(x_);

int n = x.size() ;
double current, previous = x[0] ;
int index ;
for( int i=1; in; i++){
current = x[i] ;
if( current == previous ){
x[i] = current + (++index) / 100.0 ;
} else {
index = 0 ;
}
previous = current ;
}
return x ;
', plugin = Rcpp )


with these results:

  x - sort( sample( 1:10, size = 30, replace = TRUE ) )

  system.time( nodup3( x ) )
utilisateur système écoulé
0.090 0.004 0.094
  system.time( nodup3a( x ) )
utilisateur système écoulé
0.036 0.005 0.040
  system.time( nodup4( x ) )
utilisateur système écoulé
0.025 0.004 0.029
  system.time( nodup_cpp_assumingsorted( x) )
utilisateur système écoulé
0.003 0.001 0.004



Now, if we don't make the assumption that the data is sorted, here is
another C++ based version:

require( inline )
require( Rcpp )
nodup_cpp - cxxfunction( signature( x_ = numeric ), '

// since we modify x, we need to make a copy
NumericVector x = cloneNumericVector(x_);

typedef std::mapdouble,int imap ;
typedef imap::value_type pair ;
imap index ;
int n = x.size() ;
double current, previous = x[0] ;
index.insert( pair( previous, 0 ) );

imap::iterator it = index.begin() ;
for( int i=1; in; i++){
current = x[i] ;
if( current == previous ){
x[i] = current + ( ++(it-second) / 100.0 ) ;
} else {
it = index.find(current) ;
if( it == index.end() ){
it = index.insert(
current  previous ? it : index.begin(),
pair( current, 0 )
) ;
} else {
x[i] = current + ( ++(it-second) / 100.0 ) ;
}
previous = current ;
}
}
return x ;
', plugin = Rcpp )


which gives me this :

  x - sample( 1:10, size = 30, replace = TRUE )
 
  system.time( nodup_cpp( x ) )
utilisateur système écoulé
0.111 0.002 0.113
  system.time( nodup3( sort( x ) ) )
utilisateur système écoulé
0.162 0.011 0.172
  system.time( nodup3a( sort( x ) ) )
utilisateur système écoulé
0.099 0.009 0.109
  system.time( nodup4( sort( x ) ) )
utilisateur système écoulé
0.089 0.004 0.094

so nodup4 is still faster, but the values are not in the right order:

  x - c( 2, 1, 1, 2 )
  nodup4( sort( x ) )
[1] 1.00 1.01 2.00 2.01
  nodup_cpp( x )
[1] 2.00 1.00 1.01 2.01

Romain


I think this gives a more fair comparison :

 system.time( nodup_cpp( x ) )
utilisateur système  écoulé
  0.113   0.002   0.114
 system.time( { oo - order(order(x)) ; nodup3( sort( x ) )[oo] } )
utilisateur système  écoulé
  0.336   0.012   0.347
 system.time( { oo - order(order(x)) ; nodup3a( sort( x ) )[oo] } )
utilisateur système  écoulé
  0.251   0.011   0.262
 system.time( { oo - order(order(x)) ; nodup4( sort( x ) )[oo] } )
utilisateur système  écoulé
  0.287   0.006   0.294


Romain


Le 26/11/10 20:01, William Dunlap a écrit :



-Original Message-
From: William Dunlap
Sent: Thursday, November 25, 2010 9:31 AM
To: 'randomcz'; r-help@r-project.org
Subject: RE: [R] help: program efficiency

If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3- function (t) {
t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2- function (t) {
ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:


a2- sort(sample(1:1e5, size=3e5, replace=TRUE))
system.time(v- nodup(a2))

user system elapsed
2.78 0.05 3.97

system.time(v2- nodup2(a2))

user system elapsed
1.83 0.02 2.66

system.time(v3- nodup3(a2))

user system elapsed
0.18 0.00 0.14

identical(v,v2) identical(v,v3)

[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a- function (t) {
faster.sequence- function(nvec) {
seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
nvec)
}
t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.


rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference. The following
avoids that wasted operation (along with rle's
computation of the values component of its output).

nodup4- function(t) {
n- length(t)
p- c(0L, which(t[-1L] != t[-n]), n)
t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}

That reduced nodup3a's time by about 30% on that dataset.

Bill Dunlap

Re: [R] help: program efficiency

2010-11-26 Thread Mike Marchywka











 Date: Fri, 26 Nov 2010 11:25:26 -0800
 From: roman.lust...@gmail.com
 To: r-help@r-project.org
 Subject: Re: [R] help: program efficiency


 Oops, tiny mistake. Try

 lapply(X = b, FUN = function(x) {
 swn - seq(from = 0, to = (0 + 0.01*length(x))-0.01, by = 0.01)
 out - x + swn
 return(out)
 })
 --

The way the OP stated the question, it wasn't clear what he was really
after and what he thought was a concession to what is easy. That is,
it may seem easy to add incrementing offsets to lift degeneracy but
my earlier suggestion, just adding random numbers, is probably the shortest
R code you can make and should do what is needed. If you really want
to add a fixed amount to possibly equal integers, probably the easiest
thing to do is make a hash table, does R have a hash structure?
Then, for each element, increment the value in your hash for the element value
and add it to the element. The hash table is keyed on integer value and
the hash key retrieves a value of either last increment or number of prior 
occurences. 

If you had to write this as a loop,  pseudo code would be something
like this

for i=0 to n 
{  vold=hash.get(v[i]); vnew=vold+.01; hash.put(v[i],vnew); v[i]+=vnew; }




  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-26 Thread randomcz

Thanks guys, the rle function works pretty well. Thank you all for the
efforts.

Zheng
-- 
View this message in context: 
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3061103.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-25 Thread Dimitris Rizopoulos

one way is the following:

a - c(2,1,1,3,3,3,4)

d - unlist(sapply(rle(a)$length, function (x)
if (x  1) seq(0.01, by = 0.01, len = x) else 0))

a + d


I hope it helps.

Best,
Dimitris


On 11/25/2010 3:49 PM, randomcz wrote:


hey guys,

I am working on a function to make a duplicated value unique. For example,
the original vector would be like : a = c(2,1,1,3,3,3,4)
I'll like to transform it into:
a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
basically, find the duplicates and assign a unique value by adding a small
amount and keep it in order.
I come up with the following codes, but it runs slow if t is large. Is there
a better way to do it?
nodup = function(t)
{
   t.index=0
   t.dup=duplicated(t)
   for (i in 2:length(t))
   {
 if (t.dup[i]==T)
   t.index=t.index+0.01
 else t.index=0
 t[i]=t[i]+t.index
   }
   return(t)
}




--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
Web: http://www.erasmusmc.nl/biostatistiek/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-25 Thread William Dunlap
If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3 - function (t) { 
t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2 - function (t) { 
ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:

 a2 - sort(sample(1:1e5, size=3e5, replace=TRUE))
 system.time(v - nodup(a2))
   user  system elapsed 
   2.780.053.97 
 system.time(v2 - nodup2(a2))
   user  system elapsed 
   1.830.022.66 
 system.time(v3 - nodup3(a2))
   user  system elapsed 
   0.180.000.14 
 identical(v,v2)  identical(v,v3)
[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a - function (t) {
faster.sequence - function(nvec) {
seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
nvec)
}
t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of randomcz
 Sent: Thursday, November 25, 2010 6:49 AM
 To: r-help@r-project.org
 Subject: [R] help: program efficiency
 
 
 hey guys,
 
 I am working on a function to make a duplicated value unique. 
 For example,
 the original vector would be like : a = c(2,1,1,3,3,3,4)
 I'll like to transform it into:
 a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
 basically, find the duplicates and assign a unique value by 
 adding a small
 amount and keep it in order.
 I come up with the following codes, but it runs slow if t is 
 large. Is there
 a better way to do it?
 nodup = function(t)
 {
   t.index=0
   t.dup=duplicated(t)
   for (i in 2:length(t))
   {
 if (t.dup[i]==T)
   t.index=t.index+0.01
 else t.index=0
 t[i]=t[i]+t.index
   }
   return(t)
 }
 
 
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
9p3059079.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] help: program efficiency

2010-11-25 Thread Mike Marchywka





 Date: Thu, 25 Nov 2010 06:49:19 -0800
 From: rando...@gmail.com
 To: r-help@r-project.org
 Subject: [R] help: program efficiency


 hey guys,

 I am working on a function to make a duplicated value unique. For example,
 the original vector would be like : a = c(2,1,1,3,3,3,4)
 I'll like to transform it into:
 a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
 basically, find the duplicates and assign a unique value by adding a small
 amount and keep it in order.
 I come up with the following codes, but it runs slow if t is large. Is there
 a better way to do it?


I guess I'd just make a vector of uniform or even normal random numbers
and add to your input vector. This of course is not guaranteed and adds
to uniques  but you can test and repeat and it is probably closer
to what you want but I am only speculating on your objectives.


 nodup = function(t)
 {
 t.index=0
 t.dup=duplicated(t)
 for (i in 2:length(t))
 {
 if (t.dup[i]==T)
 t.index=t.index+0.01
 else t.index=0
 t[i]=t[i]+t.index
 }
 return(t)
 }


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
   
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.