Re: [R] data.table error

2010-04-29 Thread Tom Short
Johannes, please try the latest version on R-forge (1.4). That error
has been fixed, and it's much faster. We hope to have that to CRAN
reasonably soon.

To install, use:

  install.packages("data.table",repos="http://R-Forge.R-project.org";)

- Tom

Tom Short

On Thu, Apr 29, 2010 at 3:40 PM, johannes rara  wrote:
> I'm trying to learn data.table package but I get a following annoying
> error message:
>
>> install.packages("data.table")
> trying URL 
> 'http://www.freestatistics.org/cran/bin/macosx/universal/contrib/2.10/data.table_1.2.tgz'
> Content type 'application/x-gzip' length 66823 bytes (65 Kb)
> opened URL
> ==
> downloaded 65 Kb
>
>
> The downloaded packages are in
>        
> /var/folders/n-/n-wPTanPGa4PpVd0bTgCOU+++TI/-Tmp-//RtmppqPptG/downloaded_packages
>> library(data.table)
>> cr <- data.table(cars)
>> cr[speed == 20]
>     speed dist
> [1,]    20   32
> [2,]    20   48
> [3,]    20   52
> [4,]    20   56
> [5,]    20   64
> Warning messages:
> 1: In `[.data.table`(cr, speed == 20) :
>  This R session is < 2.4.0. Please upgrade to 2.4.0+.
> 2: In `[.data.table`(cr, speed == 20) :
>  This R session is < 2.4.0. Please upgrade to 2.4.0+.
>>
>
> I'm using R 2.10.1 (see sessioninfo below), so why this error message
> keeps popping up?
>
>> sessionInfo()
> R version 2.10.1 (2009-12-14)
> i386-apple-darwin8.11.1
>
> locale:
> [1] fi_FI.UTF-8/fi_FI.UTF-8/C/C/fi_FI.UTF-8/fi_FI.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] data.table_1.2 ref_0.97
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.1
>>
>
> -J
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Code is too slow: mean-centering variables in a data framebysubgroup

2010-04-07 Thread Tom Short
Another way that Matthew Dowle showed me for this type of problem is
to reshape frame to a long format. It makes it easier to manipulate
and can be faster.

> longdt <- with(frame, data.table(group = unlist(rep(group, each=7)), x = 
> c(a,b,c,d,e,f,g)))
>
> system.time(new.frame4 <- longdt[, x/mean(x, na.rm = TRUE), by = "group"])
   user  system elapsed
   0.540.040.61
>
> # Or, remove the NAs ahead of time for more speed:
>
> longdt2 <- longdt[!is.na(longdt$x),]
> system.time(new.frame4 <- longdt2[, x/mean(x), by = "group"])
   user  system elapsed
   0.17    0.000.17

- Tom

On Wed, Apr 7, 2010 at 3:46 PM, Tom Short  wrote:
> Here's how I would have done the data.table method. It's a bit faster
> than the ave approach on my machine:
>
>> # install.packages("data.table",repos="http://R-Forge.R-project.org";)
>> library(data.table)
>>
>> f3 <- function(frame) {
> +   frame <- as.data.table(frame)
> +   frame[, lapply(.SD[,2:ncol(.SD), with = FALSE],
> +                  function(x) x / mean(x, na.rm = TRUE)),
> +         by = "group"]
> + }
>>
>> system.time(new.frame2 <- f2(frame)) # ave
>   user  system elapsed
>   0.50    0.08    1.24
>> system.time(new.frame3 <- f3(frame)) # data.table
>   user  system elapsed
>   0.25    0.01    0.30
>
> - Tom
>
> Tom Short
>
>
> On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski  
> wrote:
>> I would like to thank once more everyone who helped me with this question.
>> I compared the speed for different approaches. Below are the results
>> of my comparisons - in case anyone is interested:
>>
>> ### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs:
>> N<-10
>> set.seed(1234)
>> frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N))
>> frame<-frame[order(frame$group),]
>>
>> ## Introducing 60% NAs:
>> names.used<-names(frame)[2:length(frame)]
>> set.seed(1234)
>> for(i in names.used){
>>      i.for.NA<-sample(1:N,round((N*.6),0))
>>      frame[[i]][i.for.NA]<-NA
>> }
>> lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked
>> ORIGframe<-frame ## placeholder for the unchanged original frame
>>
>> ### Objective of the code - divide each value by its group mean 
>>
>> ### METHOD 1 - the FASTEST - using ave():##
>> frame<-ORIGframe
>> f2 <- function(frame) {
>>  for(i in 2:ncol(frame)) {
>>     frame[,i] <- ave(frame[,i], frame[,1], 
>> FUN=function(x)x/mean(x,na.rm=TRUE))
>>  }
>>  frame
>> }
>> system.time({new.frame<-f2(frame)})
>> # Took me 0.23-0.27 sec
>> ###
>>
>> ### METHOD 2 - fast, just a bit slower - using data.table:
>> ##
>>
>> # If you don't have it - install the package - NOT from CRAN:
>> install.packages("data.table",repos="http://R-Forge.R-project.org";)
>> library(data.table)
>> frame<-ORIGframe
>> system.time({
>> table<-data.table(frame)
>> colMeanFunction<-function(data,key){
>>  data[[key]]=NULL
>>  ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T)
>>  return(ret)
>> }
>> groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"]
>> names.to.use<-names(groupedMeans)
>> for(i in 
>> 1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])}
>> groupedMeans<-do.call(cbind, groupedMeans)
>> names(groupedMeans)<-names.to.use
>> })
>> # Took me 0.37-.45 sec
>> ###
>>
>> ### METHOD 3 - fast, a tad slower (using model.matrix & matrix
>> multiplication):##
>> frame<-ORIGframe
>> system.time({
>> mat <- as.matrix(frame[,-1])
>> mm <- model.matrix(~0+group,frame)
>> col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't
>> want to use NAs for mean calculations
>> # col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't
>> want to use zeros for mean calculations
>> mat[is.na(mat)] <- 0.0
>> col.grp.sum <- crossprod( mat, mm )
>> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
>> is.na(mat) <- is.na(frame[,-1])
>&g

Re: [R] Code is too slow: mean-centering variables in a data framebysubgroup

2010-04-07 Thread Tom Short
Here's how I would have done the data.table method. It's a bit faster
than the ave approach on my machine:

> # install.packages("data.table",repos="http://R-Forge.R-project.org";)
> library(data.table)
>
> f3 <- function(frame) {
+   frame <- as.data.table(frame)
+   frame[, lapply(.SD[,2:ncol(.SD), with = FALSE],
+  function(x) x / mean(x, na.rm = TRUE)),
+ by = "group"]
+ }
>
> system.time(new.frame2 <- f2(frame)) # ave
   user  system elapsed
   0.500.081.24
> system.time(new.frame3 <- f3(frame)) # data.table
   user  system elapsed
   0.250.010.30

- Tom

Tom Short


On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski  wrote:
> I would like to thank once more everyone who helped me with this question.
> I compared the speed for different approaches. Below are the results
> of my comparisons - in case anyone is interested:
>
> ### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs:
> N<-10
> set.seed(1234)
> frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N))
> frame<-frame[order(frame$group),]
>
> ## Introducing 60% NAs:
> names.used<-names(frame)[2:length(frame)]
> set.seed(1234)
> for(i in names.used){
>      i.for.NA<-sample(1:N,round((N*.6),0))
>      frame[[i]][i.for.NA]<-NA
> }
> lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked
> ORIGframe<-frame ## placeholder for the unchanged original frame
>
> ### Objective of the code - divide each value by its group mean 
>
> ### METHOD 1 - the FASTEST - using ave():##
> frame<-ORIGframe
> f2 <- function(frame) {
>  for(i in 2:ncol(frame)) {
>     frame[,i] <- ave(frame[,i], frame[,1], 
> FUN=function(x)x/mean(x,na.rm=TRUE))
>  }
>  frame
> }
> system.time({new.frame<-f2(frame)})
> # Took me 0.23-0.27 sec
> ###
>
> ### METHOD 2 - fast, just a bit slower - using data.table:
> ##
>
> # If you don't have it - install the package - NOT from CRAN:
> install.packages("data.table",repos="http://R-Forge.R-project.org";)
> library(data.table)
> frame<-ORIGframe
> system.time({
> table<-data.table(frame)
> colMeanFunction<-function(data,key){
>  data[[key]]=NULL
>  ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T)
>  return(ret)
> }
> groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"]
> names.to.use<-names(groupedMeans)
> for(i in 
> 1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])}
> groupedMeans<-do.call(cbind, groupedMeans)
> names(groupedMeans)<-names.to.use
> })
> # Took me 0.37-.45 sec
> ###
>
> ### METHOD 3 - fast, a tad slower (using model.matrix & matrix
> multiplication):##
> frame<-ORIGframe
> system.time({
> mat <- as.matrix(frame[,-1])
> mm <- model.matrix(~0+group,frame)
> col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't
> want to use NAs for mean calculations
> # col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't
> want to use zeros for mean calculations
> mat[is.na(mat)] <- 0.0
> col.grp.sum <- crossprod( mat, mm )
> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
> is.na(mat) <- is.na(frame[,-1])
> mat<-as.data.frame(mat)
> })
> # Took me 0.44-0.50 sec
> ###
>
> ### METHOD 5-  much slower - it's the one I started
> with:##
> frame<-ORIGframe
> system.time({
> frame <- do.call(cbind, lapply(names.used, function(x){
>        unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
>        }))
> })
> # Took me 1.25-1.32 min
> ###
>
> ### METHOD 6 -  the slowest; using "plyr" and
> "ddply":##
> frame<-ORIGframe
> library(plyr)
> function3 <- function(x) x / mean(x, na.rm = TRUE)
> system.time({
> grouping.factor<-"group"
> myvariables<-names(frame)[2:8]
> frame3<-ddply(frame, grouping.factor, colwise(function3, myvariables))
> })
> # Took me 1.36-1.47 min
> ###
>
>
> Thanks again!
> Dimitri
>
>
> On Wed, Mar 31, 2010 at 8:29 PM, William Dunlap  wrote:
>> Dimitri,

Re: [R] rpad ?

2010-03-23 Thread Tom Short
As the author of Rpad, I'll say that it is officially abandoned. I
just don't have the time or the need for my job. If someone is
interested in maintaining it, I'll try to answer questions (the email
address listed on the package hasn't worked for a while, and the
mailing list got overwhelmed with spam).

Of the other R web interfaces I've played with or looked at, RApache
is the most promising. It offers more performance and security than
the Rpad approach. You can also make some pretty interactive pages.
The trade-off is that it's harder to build applications (the last time
I looked anyway). To get interactivity, the RApache approach requires
a fair amount of javascript programming. Rpad gives you interactivity
fairly automatically as a webpage with embedded R code.

- Tom

Tom Short


On Tue, Mar 23, 2010 at 4:46 PM, Erich Neuwirth
 wrote:
> We are using RPad for a teaching application here.
> But we had to find many things the hard way,
> and additionally, it did not survive the latest R release change.
> There is a minimal repair, but the maintainer does not answer any email
> any more. We did the repair and are giving a modified version to our
> students, but we do not have enough resource to take over maintenance.
>
>
>
> On 3/23/2010 8:00 PM, sjaffe wrote:
>>
>> Is anyone using rpad? Is there any documentation or examples beyond that in
>> the 'man' directory of the source?
>>
>
> --
> Erich Neuwirth, University of Vienna
> Faculty of Computer Science
> Computer Supported Didactics Working Group
> Visit our SunSITE at http://sunsite.univie.ac.at
> Phone: +43-1-4277-39464 Fax: +43-1-4277-39459
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data.table evaluating columns

2010-03-02 Thread Tom Short
On Tue, Mar 2, 2010 at 7:09 PM, Rob Forler  wrote:
> Hi everyone,
>
> I have the following code that works in data frames taht I would like tow
> ork in data.tables . However, I'm not really sure how to go about it.
>
> I basically have the following
>
> names = c("data1", "data2")
> frame = data.frame(list(key1=as.integer(c(1,2,3,4,5,6)),
> key2=as.integer(c(1,2,3,2,5,6)),data1 =  c(3,3,2,3,5,2), data2=
> c(3,3,2,3,5,2)))
>
> for(i in 1:length(names)){
> frame[, paste(names[i], "flag")] = frame[,names[i]] < 3
>
> }
>
> Now I try with data.table code:
> names = c("data1", "data2")
> frame = data.table(list(key1=as.integer(c(1,2,3,4,5,6)),
> key2=as.integer(c(1,2,3,2,5,6)),data1 =  c(3,3,2,3,5,2), data2=
> c(3,3,2,3,5,2)))
>
> for(i in 1:length(names)){
> frame[, paste(names[i], "flag"), with=F] = as.matrix(frame[,names[i],
> with=F] )< 3
>
> }

Rob, this type of question is better for the package maintainer(s)
directly rather than R-help. That said, one answer is to use list
addressing:

for(i in 1:length(names)){
frame[[paste(names[i], "flag")]] = frame[[names[i]]] < 3
}

Another option is to manipulate frame as a data frame and convert to
data.table when you need that functionality (conversion is quick).

In the data table version, frame[,names[i], with=F] is the same as
frame[,names[i], drop=FALSE] (the answer is a list, not a vector).
Normally, it's easier to use [[]] or $ indexing to get this. Also,
fname[i,j] <- something assignment is still a bit buggy for
data.tables.

- Tom

Tom Short

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] dramatic speed difference in lapply

2010-02-26 Thread Tom Short
I'm sorry, Rob, but that code is dense enough and formatted badly
enough that it's hard to dig through.

You may want to try the data.table package. The development version on
R-forge is pretty fast for grouping operations like this. I'm not sure
if this is what you're really after. It's hard to tell from your
example.

Compare some speeds:

> dat <- data.frame(D=sample(32000:33000, 666000,T),
+   Fid=sample(1:10,666000,T),
+   A=sample(1:5,666000,T))
>
> ### one of your examples
> system.time(ret <- fedb.ddplyWrapper2(dat, c("D", "Fid"),
+ function(x) c(sum(x[,"A"], na.rm=T),
sum(x[,"A"], na.rm=T
   user  system elapsed
  21.78   14.42   36.35
>
>
> ### data.table
> install.packages("data.table",repos="http://R-Forge.R-project.org";)
> library(data.table)
> dt <- as.data.table(dat)
> system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"])
   user  system elapsed
   0.270.000.28
>
>
> ### plyr for comparison, too
> library(plyr)
> system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T)))
   user  system elapsed
  28.94   12.16   41.23

> head(ret)
  [,1] [,2]
1  175  175
2  222  222
3  221  221
4  134  134
5  253  253
6  194  194

> head(ret2)
 D Fid  V1
[1,] 32000   1 228
[2,] 32000   2 209
[3,] 32000   3 182
[4,] 32000   4 180
[5,] 32000   5 181
[6,] 32000   6 222

> head(ret3)
  D Fid  V1
1 32000   1 175
2 32000   2 222
3 32000   3 221
4 32000   4 134
5 32000   5 253
6 32000   6 194


- Tom


On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler  wrote:
> So I have a function that does lapply's for me based on dimension. Currently
> only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I
> have two versions. One runs WAYYY faster than the other. And I'm not sure
> why.
>
> Fast Version:
>
> fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,
> ...){
>    lapplyFunctionRecurse <- function(cdata, level=1, ...){
>        if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
>        } else if (level==length(pivotColumns)) {
>            #
> return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
> function(x, ...) listNameFunctions(data[x,], ...)))
>            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
> data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
> sum(data[cdata,"A"], na.rm=T
>        } else {
>            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
>        }
>    }
>    result = lapplyFunctionRecurse(data, ...)
>    matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
>    return(matrix2)
> }
>
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
>   user  system elapsed
>  4.616   0.006   4.630
> #note in thie case the anonymous function I pass in isn't used because I
> hardcode the function into the lapply.
>
> approx 4 seconds
>
> This runs very fast. This runs very slow:
>
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){
>    lapplyFunctionRecurse <- function(cdata, level=1, ...){
>        if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
>        } else if (level==length(pivotColumns)) {
>            #this line is different. it essentially calls the function you
> pass in
>            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
>        } else {
>            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
>        }
>    }
>    result = lapplyFunctionRecurse(data, ...)
>    matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
>    return(matrix2)
> }
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
>   user  system elapsed
>  16.346  65.059  81.680
>
> head(ret3)
  D Fid  V1
1 32000   1 175
2 32000   2 222
3 32000   3 221
4 32000   4 134
5 32000   5 253
6 32000   6 194
>
>
> Can anyone explain to me why there is a 4x time difference? I don't want to
> have to hardcore into the recursion function, but if I have to I will.
>
> Thanks,
> Rob
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-

Re: [R] how to fast extract values from different list elements

2010-02-25 Thread Tom Short
On Thu, Feb 25, 2010 at 4:10 AM, Heym, Peter-Paul  wrote:

> this works fine but it is very slow (since A and B can be very large and I
> have to repeat this about 5000 times). I would like to make this faster using
> e.g. apply or lapply but I didn't get it work using these methods. Does
> anybody know an EFFICIENT or FAST way extract the values from L using the
> values from A and B?

Instead of L[[A[i]]][B[i]], try L[A][B]

- Tom

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to rearrange a dataframe

2010-02-23 Thread Tom Short
Try this:

a <- b <- read.table(textConnection("
1 + name1 1 2 3
2 + name2 5 9 10
2 - name3 56 74 93
1 - name4 65 75 98"), skip=1, header=FALSE)

swapidx <- with(a, (V1 == 2 & V2 == "+") | (V1 == 1 & V2 == "-"))
b[swapidx,] <- b[swapidx, c(1:3,6:4)]

This creates an indexing vector that identifies which rows to swap,
then the 6:4 flips around the fourth through sixth columns.

- Tom

On Tue, Feb 23, 2010 at 5:27 PM, Laura Rodriguez Murillo
 wrote:
> Hi all,
>
> I'd appreciate if anyone can help me with this...
>
> I have a data frame that looks like this:
>
> 1 + name1 1 2 3
> 2 + name2 5 9 10
> 2 - name3 56 74 93
> 1 - name4 65 75 98
>
> I need to rearrange this in a way so that the rows with  "1" in the
> first column, and "-" in the second column; then columns 4 and 6
> should switch places. That is, column 6 would be now column 4 and
> column 4 would be column 6 (column 5 should stay as column 5)
> In the same way, if the first column is "2" and the second is "+",
> then the same rearrangement should be done.
> Rows with the first two entries 1 + or 2 - should stay in the same order.
> This should be done for each row independently.
>
> Thanks a lot for your help!
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Large dataset importing, columns merging and splitting

2010-01-26 Thread Tom Short
If you need more aggregations on the stock (I assume that's what the
first column is), I'd use the data.table package. It allows fast
indexing and merge operations. That's handy if you have other features
of a stock (like company size or industry sector) that you'd like to
include in the aggregation. Like Gabor, I'd probably use chron for
keeping track of the dates.

Here's some code to get you started:

Lines <- "CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 300 0
CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 40 51 74.25 100 0
CVX 20070201 9 40 52 74.25 100 0
CVX 20070201 9 40 53 74.25 300 0
CVX 20070301 9 30 51 74.25 100 0
CVX 20070301 9 30 51 74.25 100 0
CVX 20070301 9 30 51 74.25 300 0
CVX 20070301 9 30 51 74.25 81400 0
CVX 20070301 9 40 51 74.25 100 0
CVX 20070301 9 40 52 74.25 100 0
CVX 20070301 9 40 53 74.25 300 0
DVX 20070201 9 30 51 73.25 81400 0
DVX 20070201 9 30 51 73.25 100 0
DVX 20070201 9 30 51 73.25 100 0
DVX 20070201 9 30 51 73.25 300 0
DVX 20070201 9 30 51 73.25 81400 0
DVX 20070201 9 40 51 74.25 100 0
DVX 20070201 9 40 52 74.25 100 0
DVX 20070201 9 40 53 74.25 300 0
DVX 20070301 9 30 51 74.25 100 0
DVX 20070301 9 30 51 74.25 100 0
DVX 20070301 9 30 51 74.25 300 0
DVX 20070301 9 30 51 74.25 81400 0
DVX 20070301 9 40 51 74.25 100 0
DVX 20070301 9 40 52 74.25 100 0
DVX 20070301 9 40 53 74.25 300 0"


library(data.table)
library(chron)
dt <- data.table(read.table(textConnection(Lines),
                           colClasses = c("character", "numeric",
"numeric", "numeric", "numeric", "numeric",
                           "numeric", "numeric"),
                           col.names = c("stock", "date", "h", "m",
"s", "Price", "Volume", "xx")))
dt$date <- as.chron(as.Date(as.character(dt$date), format = "%Y%m%d"))
+ dt$h/24 + dt$m/(60*24) + dt$s/(60*60*24)
dt$roundeddate <- as.integer(floor(as.numeric(dt$date) * (24 * 12))) #
data.table likes integers

dt[,list(meanprice = mean(Price), volume = sum(Volume)), by = "roundeddate"]
dt[,list(meanprice = mean(Price), volume = sum(Volume)), by =
"stock,roundeddate"]

You'd still probably want to turn the roundeddate back into a real
chron object. If you use aggregation a lot, the development version of
data.table has faster aggregations:
http://r-forge.r-project.org/projects/datatable/

- Tom

On Tue, Jan 26, 2010 at 11:23 AM, Gabor Grothendieck
 wrote:
> Try this using the development version of read.zoo in zoo (which we
> source from the R-Forge on the fly).
>
> We use "NULL" in colClasses for those columns we don't need but in
> col.names we still have to include dummy names for
> them.  Of what is left the index is the first three columns (1:3)
> which we convert to chron class times in FUN and then truncate to 5
> seconds in FUN2.  Finally we use aggregate = mean to average over the
> 5 second intervals.
>
> Lines <- "CVX 20070201 9 30 51 73.25 81400 0
> CVX 20070201 9 30 51 73.25 100 0
> CVX 20070201 9 30 51 73.25 100 0
> CVX 20070201 9 30 51 73.25 300 0
> CVX 20070201 9 30 51 73.25 81400 0
> CVX 20070201 9 40 51 73.25 100 0
> CVX 20070201 9 40 52 73.25 100 0
> CVX 20070201 9 40 53 73.25 300 0"
>
>
> library(zoo)
> source("http://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/pkg/zoo/R/read.zoo.R?rev=611&root=zoo";)
> library(chron)
>
> z <- read.zoo(textConnection(Lines),
>        colClasses = c("NULL", "NULL", "numeric", "numeric", "numeric", 
> "numeric",
>                "numeric", "NULL"),
>        col.names = c("V1", "V2", "V3", "V4", "V5", "Price", "Volume", "V8"),
>        index = 1:3,
>        FUN = function(tt) times(paste(tt[,1], tt[,2], tt[,3], sep = ":")),
>        FUN2 = function(tt) trunc(tt, "00:00:05"),
>        aggregate = mean)
>
> The result of running the above is:
>
>> z
>         Price     Volume
> 09:30:50 73.25 32660.
> 09:40:50 73.25   166.6667
>
> On Tue, Jan 26, 2010 at 10:48 AM, Manta  wrote:
>>
>> Dear All,
>> I have a large data set that looks like this:
>>
>> CVX 20070201 9 30 51 73.25 81400 0
>> CVX 20070201 9 30 51 73.25 100 0
>> CVX 20070201 9 30 51 73.25 100 0
>> CVX 20070201 9 30 51 73.25 300 0
>>
>> First, I would like to import it by merging column 3 4 and 5, since that is
>> the timestamp. Then, I would like to aggregate the data by splitting them in
>> bins of 5 minutes size, therefore from 93000 up to 93459 etc, givin as
>> output the average price and volume in the 5 minutes bin.
>>
>> Hope this helps,
>> Best,
>>
>> Marco
>> --
>> View this message in context: 
>> http://n4.nabble.com/Large-dataset-importing-columns-merging-and-splitting-tp1294668p1294668.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

Re: [R] Getting file name from pdf device?

2009-08-01 Thread Tom Short
On Fri, Jul 31, 2009 at 8:49 AM, Rainer M Krug wrote:
> My question: how can I get the filename of the pdf from the device
> before it is closed?

I've also looked for this and couldn't find a way. I had a similar
use, where I wanted to get an R transcript with embedded plots in
emacs (see prettyR for another transcript-with-plots option). What I
did was use dev2bitmap to write out a PNG file. You could do something
similar with dev.copy2pdf to create the pdf after you do the plotting.
You could also use dev2bitmap in this manner to drive ghostscript to
create pdf's for you (I don't know if it'll compress like you want).
Here's what I did:

show <- function(file = paste(tempfile(), ".png", sep = "")) {
dev2bitmap(file)
cat("[[", file, "]]\n", sep = "") # I do some post-processing in
emacs to see the embedded graphic
}

My use case was that plots would be inserted where I used "show" as follows:

plot(sin)
show()# < plot inserted into transcript here
plot(cos)
show("cos.png") # this time, a named local file instead of a temp file

- Tom

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Excel Export in a beauty way

2009-06-07 Thread Tom Short
Another useful way to create a formatted Excel file is to write out an
HTML file, but put an XLS extension on it. When Excel reads it, it
will convert it. Users will treat it like an Excel file. This trick
allows you to add formatted titles, table footnotes, links to other
files (pdf graphs for example), and more.

To create HTML, you have several packages that can help you out:
R2HTML, Rpad, hwriter, and xtable. Not everything might convert
properly, so you may have to experiment. Data frames as tables
normally convert nicely.

- Tom

Tom Short

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Do you use R for data manipulation?

2009-05-06 Thread Tom Short
Another tool I find useful is Matthew Dowle's data.table package. It
has very fast indexing, can have much lower memory requirements than a
data frame, and has some built-in data manipulation capability.
Especially with a 64-bit OS, you can use this to keep things in memory
where you otherwise would have to use a database.

See here: http://article.gmane.org/gmane.comp.lang.r.packages/282

- Tom

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.