Re: [Rd] How to create data frame column name in a function

2012-05-03 Thread Rui Barradas
Hello,


pvshankar wrote
 
 Hello all,
 
 I have a data frame with column names s1, s2, s3s11
 
 I have a function that gets two parameters, one is used as a subscript for
 the column names  and another is used as an index into the chosen column.
 
 For example:
 
 my_func - function(subscr, index)
 {
   if (subscr == 1)
   {
 df$s1[index] - some value
   }
 }
 
 The problem is, I do not want to create a bunch of if statements (one for
 each 1:11 column names)). 
 Instead, I want to create the column name in run time based on subscr
 value.
 
 I tried eval(as.name(paste(df$s,subscr,sep=)))[index] - some value
 
 and it complains that object df$s1 is not found.
 
 Could someone please help me with this?
 (Needless to say, I have just started programing in R)
 
 Thanks,
 Shankar
 

Instead of operator '$' use function`[-` with the right indexes.


cname - paste(s, subscr, sep=)
DF[index, cname] - value

See
?[-.data.frame

(And df is the name of an R function, use something else, it can be
confusing.)

Hope this helps,

Rui Barradas


--
View this message in context: 
http://r.789695.n4.nabble.com/How-to-create-data-frame-column-name-in-a-function-tp4605358p4605939.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Proposal: model.data

2012-05-03 Thread Paul Johnson
Greetings:

I'm still working on functions to make it easier for students to
interpret regression predictions.  I am working out a scheme to
more easily create newdata objects (for use in predict functions).
This has been done before in packages like Zelig, Effects,
rms, and others. Nothing is exactly right, though. Stata users keep
flashing their predicted probabity tables at me and I'd like
something like that to use with R.

I'm proposing here a function model.data that receives a regression
and creates a dataframe of raw data from which newdata objects
can be constructed. This follows a suggestion that Bill Dunlap made
to me in response to a question I posted in r-help.

While studying termplot code, I saw the carrier function approach
to deducing the raw predictors.  However, it does not always work.

Here is one problem. termplot mistakes 10 in log(10 + x1) for a variable

Example:

dat - data.frame(x1 = rnorm(100), x2 = rpois(100, lambda=7))
STDE - 10
dat$y - 1.2 * log(10 + dat$x1) + 2.3 * dat$x2 + rnorm(100, sd = STDE)

m1 - lm( y ~ log(10 + x1)  + x2, data=dat)
termplot(m1)

## See the trouble? termplot thinks 10 is the term to plot.

Another problem is that predict( type=terms) does not behave
sensibly sometimes. RHS of formula that have nonlinear transformations
are misunderstood as separate terms.

##Example:
dat$y2 - 1.2 * log(10 + dat$x1) + 2.3 * dat$x1^2 + rnorm(100, sd = STDE)

m2 - lm( y2 ~ log(10 + x1)  + sin(x1), data=dat)
summary(m2)

predict(m2, type=terms)

## Output:
## log(10 + x1) sin(x1)
## 1 1.50051781 -2.04871711
## 2-0.14707391  0.31131124

What I wish would happen instead is one correct prediction
for each value of x1. This should be the output:

predict(m2, newdata = data.frame(x1 = dat$x1))

##  predict(m2, newdata = data.frame(x1 = dat$x1))
##   12345678
## 17.78563 18.49806 17.50719 19.70093 17.45071 19.69718 18.84137 18.89971

The fix I'm testing now is the following new function, model.data.
which tries to re-create the data object that would be
consistent with a fitted model. This follows a suggestion from
Bill Dunlap in r-help on 2012-04-22



##' Creates a raw (UNTRANSFORMED) data frame equivalent
##' to the input data that would be required to fit the given model.
##'
##' Unlike model.frame and model.matrix, this does not return transformed
##' variables.
##'
##' @param model A fitted regression model in which the data argument
##' is specified. This function will fail if the model was not fit
##' with the data option.
##' @return A data frame
##' @export
##' @author Paul E. Johnson pauljohn@@ku.edu
##' @example inst/examples/model.data-ex.R
model.data - function(model){
fmla - formula(model)
allnames - all.vars(fmla) ## all variable names
## indep variables, includes d in poly(x,d)
ivnames - all.vars(formula(delete.response(terms(model
## datOrig: original data frame
datOrig -  eval(model$call$data, environment(formula(model)))
if (is.null(datOrig))stop(model.data: input model has no data frame)
## datOrig: almost right, but includes d in poly(x, d)
dat - get_all_vars(fmla, datOrig)
## Get rid of d and other non variable variable names that are
not in datOrig:
keepnames - intersect(names(dat), names(datOrig))
## Keep only rows actually used in model fit, and the correct columns
dat - dat[ row.names(model$model) , keepnames]
## keep ivnames that exist in datOrig
attr(dat, ivnames) - intersect(ivnames, names(datOrig))
invisible(dat)
}


This works for the test cases like log(10+x) and so forth:

## Examples:

head(m1.data - model.data(m1))

head(m2.data - model.data(m2))

##  head(m1.data - model.data(m1))
##  y  x1 x2
## 1 18.53846  0.46176539  8
## 2 28.24759  0.09720934  7
## 3 23.88184  0.67602556  9
## 4 23.50130 -0.74877054  8
## 5 25.81714  1.02555255  5
## 6 24.75052 -0.69659539  6
##  head(m2.data - model.data(m2))
##  y  x1
## 1 18.53846  0.46176539
## 2 28.24759  0.09720934
## 3 23.88184  0.67602556
## 4 23.50130 -0.74877054
## 5 25.81714  1.02555255
## 6 24.75052 -0.69659539


d - 2
m4 - lm(y ~ poly(x1,d), data=dat)

head(m4.data - model.data(m4))

##  y  x1
## 1 18.53846  0.46176539
## 2 28.24759  0.09720934
## 3 23.88184  0.67602556

Another strength of this approach is that the return object has an
attribute ivnames.  If R's termplot used model.dat instead of the
carrier functions, this would make for a much tighter set of code.

What flaws do you see in this?

One flaw is that I did not know how to re-construct data from the
parent environment, so I insist the regression model has to have a
data argument. Is this necessary, or can one of the R experts help.

Another possible flaw: I'm keeping the columns from the data frame
that are needed to re-construct the model.frame, and to match
the rows, I'm using row.names for the model.frame.

Are there other formulae that 

Re: [Rd] Proposal: model.data

2012-05-03 Thread Brian G. Peterson
On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote:
 If somebody in R Core would like this and think about putting it, or
 something like it, into the base, then many chores involving predicted
 values would become much easier.
 
Why does this need to be in base?  Implement it in a package.
 
If it works, and is additive, people will use it.  Look at 'reshape' or
'xts' or 'Matrix' just to name a few examples of widely used packages.

Regards,

   - Brian

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Proposal: model.data

2012-05-03 Thread Paul Johnson
Greetings:

On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson br...@braverock.com wrote:
 On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote:
 If somebody in R Core would like this and think about putting it, or
 something like it, into the base, then many chores involving predicted
 values would become much easier.

 Why does this need to be in base?  Implement it in a package.

 If it works, and is additive, people will use it.  Look at 'reshape' or
 'xts' or 'Matrix' just to name a few examples of widely used packages.


I can't use it to fix termplot unless it is in base.

Or are you suggesting I create my own termplot replacement?


 Regards,

   - Brian





-- 
Paul E. Johnson
Professor, Political Science    Assoc. Director
1541 Lilac Lane, Room 504     Center for Research Methods
University of Kansas               University of Kansas
http://pj.freefaculty.org            http://quant.ku.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Proposal: model.data

2012-05-03 Thread Brian G. Peterson
On Thu, 2012-05-03 at 12:09 -0500, Paul Johnson wrote:
 Greetings:
 
 On Thu, May 3, 2012 at 11:36 AM, Brian G. Peterson br...@braverock.com 
 wrote:
  On Thu, 2012-05-03 at 10:51 -0500, Paul Johnson wrote:
  If somebody in R Core would like this and think about putting it, or
  something like it, into the base, then many chores involving predicted
  values would become much easier.
 
  Why does this need to be in base?  Implement it in a package.
 
  If it works, and is additive, people will use it.  Look at 'reshape' or
  'xts' or 'Matrix' just to name a few examples of widely used packages.
 
 
 I can't use it to fix termplot unless it is in base.
 
 Or are you suggesting I create my own termplot replacement?

I was suggesting that you create a package that has all the features
that you think it needs.  

If you have a *patch* for termplot that would fix what you perceive to
be its problems, and not break existing code, then the usual method
would be to propose that.  It seems, though, that you are proposing more
significant changes to functionality, and it seems as though that would
run a risk of breaking backwards compatibility, which is usually a bad
idea.

Regards,

   - Brian

-- 
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread victor jimenez
Sometimes I have hundreds of CSV files scattered in a directory tree,
resulting from experiments' executions. For instance, giving an example
from my field, I may want to collect the performance of a processor for
several design parameters such as cache size (possible values: 2, 4, 8
and 16) and cache associativity (possible values: direct-mapped, 4-way,
fully-associative). The results of all these experiments will be stored in
a directory tree like:

results
  |-- direct-mapped
  |   |-- 2 -- data.csv
  |   |-- 4 -- data.csv
  |   |-- 8 -- data.csv
  |   |-- 16 -- data.csv
  |-- 4-way
  |   |-- 2 -- data.csv
  |   |-- 4 -- data.csv
...
  |-- fully-associative
  |   |-- 2 -- data.csv
  |   |-- 4 -- data.csv
...

I am developing a package that would allow me to gather all those CSV into
a single data frame. Currently, I just need to execute the following
statement:

dframe - gather(results/@ASSOC@/@SIZE@/data.csv)

and this command returns a data frame containing the columns ASSOC, SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a single
data frame. So, I would get something like:

ASSOC,  SIZE, PERF
direct-mapped,   2, 1.4
direct-mapped,   4, 1.6
direct-mapped,   8, 1.7
direct-mapped, 16, 1.7
4-way,   2, 1.4
4-way,   4, 1.5
...

I would like to ask whether there is any similar functionality already
implemented in R. If so, there is no need to reinvent the wheel :)
If it is not implemented and the R community believes that this feature
would be useful, I would be glad to contribute my code.

Thank you,
Victor

P.S: I was not sure whether to submit this question to R-devel or R-help,
but since it may lead to some programming discussion I decided to post it
to R-devel. Please, let me know if it is better to move it to the other
list.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread Gabor Grothendieck
On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com wrote:
 Sometimes I have hundreds of CSV files scattered in a directory tree,
 resulting from experiments' executions. For instance, giving an example
 from my field, I may want to collect the performance of a processor for
 several design parameters such as cache size (possible values: 2, 4, 8
 and 16) and cache associativity (possible values: direct-mapped, 4-way,
 fully-associative). The results of all these experiments will be stored in
 a directory tree like:

 results
  |-- direct-mapped
  |       |-- 2 -- data.csv
  |       |-- 4 -- data.csv
  |       |-- 8 -- data.csv
  |       |-- 16 -- data.csv
  |-- 4-way
  |       |-- 2 -- data.csv
  |       |-- 4 -- data.csv
 ...
  |-- fully-associative
  |       |-- 2 -- data.csv
  |       |-- 4 -- data.csv
 ...

 I am developing a package that would allow me to gather all those CSV into
 a single data frame. Currently, I just need to execute the following
 statement:

 dframe - gather(results/@ASSOC@/@SIZE@/data.csv)

 and this command returns a data frame containing the columns ASSOC, SIZE
 and all the remaining columns inside the CSV files (in my case the
 processor performance), effectively loading all the CSV files into a single
 data frame. So, I would get something like:

 ASSOC,          SIZE, PERF
 direct-mapped,       2,     1.4
 direct-mapped,       4,     1.6
 direct-mapped,       8,     1.7
 direct-mapped,     16,     1.7
 4-way,                   2,     1.4
 4-way,                   4,     1.5
 ...

 I would like to ask whether there is any similar functionality already
 implemented in R. If so, there is no need to reinvent the wheel :)
 If it is not implemented and the R community believes that this feature
 would be useful, I would be glad to contribute my code.


If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.

library(zoo)
?read.zoo
vignette(zoo-read)

Also see the other zoo vignettes and help files.

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] The constant part of the log-likelihood in StructTS

2012-05-03 Thread Thomas Lumley
On Thu, May 3, 2012 at 3:36 AM, Mark Leeds marklee...@gmail.com wrote:
 Hi Ravi: As far as I know ( well , really read ) and Bert et al can say
 more , the AIC is not dependent on the models being nested as long as the
 sample sizes used are the same when comparing. In some cases, say comparing
 MA(2), AR(1), you have to be careful with sample size usage but there is no
 nesting requirement for AIC atleast, I'm pretty sure.

This is only partly true.  The expected value of the AIC will behave
correctly even if models are non-nested, but there is no general
guarantee that the standard deviation is small,  so AIC need not even
asymptotically lead to optimal model choice for prediction in
arbitrary non-nested models.

Having said that, 'nearly' nested models like these are probably ok.
I believe it's sufficient that all your models are nested in a common
model, with a bound on the degree of freedom difference, but my copy
of Claeskens  Hjort's book on model selection and model averaging is
currently with a student so I can't be definitive.


   -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread victor jimenez
First of all, thank you for the answers. I did not know about zoo. However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my opinion,
none of the proposed solutions would work, unless every single data.csv
file contained all the three columns (ASSOC, SIZE and PERF).

In my case, my experimentation framework basically outputs a CSV with some
values read from the processor's performance counters (PMCs). For each
cache size and associativity I conduct an experiment, creating a CSV file,
and placing that file into its own directory. I could modify the
experimentation framework, so that it also outputs the cache size and
associativity, but that may not be ideal in some circumstances and I also
have a significant amount of old results and I want keep using them without
manually fixing the CSV files.

Has anyone else faced such a situation? Any good solutions?

Thank you,
Victor

On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
ggrothendi...@gmail.comwrote:

 On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com
 wrote:
  Sometimes I have hundreds of CSV files scattered in a directory tree,
  resulting from experiments' executions. For instance, giving an example
  from my field, I may want to collect the performance of a processor for
  several design parameters such as cache size (possible values: 2, 4, 8
  and 16) and cache associativity (possible values: direct-mapped, 4-way,
  fully-associative). The results of all these experiments will be stored
 in
  a directory tree like:
 
  results
   |-- direct-mapped
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
   |   |-- 8 -- data.csv
   |   |-- 16 -- data.csv
   |-- 4-way
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
   |-- fully-associative
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
 
  I am developing a package that would allow me to gather all those CSV
 into
  a single data frame. Currently, I just need to execute the following
  statement:
 
  dframe - gather(results/@ASSOC@/@SIZE@/data.csv)
 
  and this command returns a data frame containing the columns ASSOC, SIZE
  and all the remaining columns inside the CSV files (in my case the
  processor performance), effectively loading all the CSV files into a
 single
  data frame. So, I would get something like:
 
  ASSOC,  SIZE, PERF
  direct-mapped,   2, 1.4
  direct-mapped,   4, 1.6
  direct-mapped,   8, 1.7
  direct-mapped, 16, 1.7
  4-way,   2, 1.4
  4-way,   4, 1.5
  ...
 
  I would like to ask whether there is any similar functionality already
  implemented in R. If so, there is no need to reinvent the wheel :)
  If it is not implemented and the R community believes that this feature
  would be useful, I would be glad to contribute my code.
 

 If your csv files all have the same columns and represent time series
 then read.zoo in the zoo package can read multiple csv files in at
 once using a single read.zoo command producing a single zoo object.

 library(zoo)
 ?read.zoo
 vignette(zoo-read)

 Also see the other zoo vignettes and help files.

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread Cook, Malcolm
Victor,

I understand you as follows

The first two columns of the desired combined dataframe are the last two
levels of the pathname to the csv file.

The columns in all the data.csv files are the same, namely, there is 
only
one column, and it is named PERF.

If so, the following should work (on unix)

do.call(rbind,lapply(Sys.glob('results/*/*/data.csv'),function(path)
{within(read.csv(path),{ SIZE-basename(dirname(path));
ASSOC-basename(dirname(dirname(path)))})}))


On 5/3/12 4:40 PM, victor jimenez betaband...@gmail.com wrote:

First of all, thank you for the answers. I did not know about zoo.
However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my
opinion,
none of the proposed solutions would work, unless every single data.csv
file contained all the three columns (ASSOC, SIZE and PERF).

In my case, my experimentation framework basically outputs a CSV with some
values read from the processor's performance counters (PMCs). For each
cache size and associativity I conduct an experiment, creating a CSV file,
and placing that file into its own directory. I could modify the
experimentation framework, so that it also outputs the cache size and
associativity, but that may not be ideal in some circumstances and I also
have a significant amount of old results and I want keep using them
without
manually fixing the CSV files.

Has anyone else faced such a situation? Any good solutions?

Thank you,
Victor

On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
ggrothendi...@gmail.comwrote:

 On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com
 wrote:
  Sometimes I have hundreds of CSV files scattered in a directory tree,
  resulting from experiments' executions. For instance, giving an
example
  from my field, I may want to collect the performance of a processor
for
  several design parameters such as cache size (possible values: 2,
4, 8
  and 16) and cache associativity (possible values: direct-mapped,
4-way,
  fully-associative). The results of all these experiments will be
stored
 in
  a directory tree like:
 
  results
   |-- direct-mapped
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
   |   |-- 8 -- data.csv
   |   |-- 16 -- data.csv
   |-- 4-way
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
   |-- fully-associative
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
 
  I am developing a package that would allow me to gather all those CSV
 into
  a single data frame. Currently, I just need to execute the following
  statement:
 
  dframe - gather(results/@ASSOC@/@SIZE@/data.csv)
 
  and this command returns a data frame containing the columns ASSOC,
SIZE
  and all the remaining columns inside the CSV files (in my case the
  processor performance), effectively loading all the CSV files into a
 single
  data frame. So, I would get something like:
 
  ASSOC,  SIZE, PERF
  direct-mapped,   2, 1.4
  direct-mapped,   4, 1.6
  direct-mapped,   8, 1.7
  direct-mapped, 16, 1.7
  4-way,   2, 1.4
  4-way,   4, 1.5
  ...
 
  I would like to ask whether there is any similar functionality already
  implemented in R. If so, there is no need to reinvent the wheel :)
  If it is not implemented and the R community believes that this
feature
  would be useful, I would be glad to contribute my code.
 

 If your csv files all have the same columns and represent time series
 then read.zoo in the zoo package can read multiple csv files in at
 once using a single read.zoo command producing a single zoo object.

 library(zoo)
 ?read.zoo
 vignette(zoo-read)

 Also see the other zoo vignettes and help files.

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com


   [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread Simon Urbanek

On May 3, 2012, at 5:40 PM, victor jimenez wrote:

 First of all, thank you for the answers. I did not know about zoo. However,
 it seems that none approach can do what I exactly want (please, correct me
 if I am wrong).
 
 Probably, it was not clear in my original question. The CSV files only
 contain the performance values. The other two columns (ASSOC and SIZE) are
 obtained from the existing values in the directory tree. So, in my opinion,
 none of the proposed solutions would work, unless every single data.csv
 file contained all the three columns (ASSOC, SIZE and PERF).
 
 In my case, my experimentation framework basically outputs a CSV with some
 values read from the processor's performance counters (PMCs). For each
 cache size and associativity I conduct an experiment, creating a CSV file,
 and placing that file into its own directory. I could modify the
 experimentation framework, so that it also outputs the cache size and
 associativity, but that may not be ideal in some circumstances and I also
 have a significant amount of old results and I want keep using them without
 manually fixing the CSV files.
 

You don't need to touch the CSV files, simply add values at load time - this is 
all easily doable in one line ;)

 do.call(rbind,lapply(Sys.glob(*/*/data.csv),function(d) 
 cbind(read.csv(d),as.data.frame(t(strsplit(d,/)[[1]])
  A B V1 V2   V3
1 1 2  1  a data.csv
2 3 4  1  a data.csv
3 1 2  1  b data.csv
4 3 4  1  b data.csv
5 1 2  2  a data.csv
6 3 4  2  a data.csv


 Has anyone else faced such a situation? Any good solutions?
 
 Thank you,
 Victor
 
 On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
 ggrothendi...@gmail.comwrote:
 
 On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com
 wrote:
 Sometimes I have hundreds of CSV files scattered in a directory tree,
 resulting from experiments' executions. For instance, giving an example
 from my field, I may want to collect the performance of a processor for
 several design parameters such as cache size (possible values: 2, 4, 8
 and 16) and cache associativity (possible values: direct-mapped, 4-way,
 fully-associative). The results of all these experiments will be stored
 in
 a directory tree like:
 
 results
 |-- direct-mapped
 |   |-- 2 -- data.csv
 |   |-- 4 -- data.csv
 |   |-- 8 -- data.csv
 |   |-- 16 -- data.csv
 |-- 4-way
 |   |-- 2 -- data.csv
 |   |-- 4 -- data.csv
 ...
 |-- fully-associative
 |   |-- 2 -- data.csv
 |   |-- 4 -- data.csv
 ...
 
 I am developing a package that would allow me to gather all those CSV
 into
 a single data frame. Currently, I just need to execute the following
 statement:
 
 dframe - gather(results/@ASSOC@/@SIZE@/data.csv)
 
 and this command returns a data frame containing the columns ASSOC, SIZE
 and all the remaining columns inside the CSV files (in my case the
 processor performance), effectively loading all the CSV files into a
 single
 data frame. So, I would get something like:
 
 ASSOC,  SIZE, PERF
 direct-mapped,   2, 1.4
 direct-mapped,   4, 1.6
 direct-mapped,   8, 1.7
 direct-mapped, 16, 1.7
 4-way,   2, 1.4
 4-way,   4, 1.5
 ...
 
 I would like to ask whether there is any similar functionality already
 implemented in R. If so, there is no need to reinvent the wheel :)
 If it is not implemented and the R community believes that this feature
 would be useful, I would be glad to contribute my code.
 
 
 If your csv files all have the same columns and represent time series
 then read.zoo in the zoo package can read multiple csv files in at
 once using a single read.zoo command producing a single zoo object.
 
 library(zoo)
 ?read.zoo
 vignette(zoo-read)
 
 Also see the other zoo vignettes and help files.
 
 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com
 
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Setting up a windows system for rcpp

2012-05-03 Thread Owe Jessen
I am running into a wall getting my system to work with rcpp and inline. 
Following Dirk's advice on stackoverflow, I hope someone is able to help 
me.

My steps were to install MinGW 32 bit first, then installing Rtools, I 
disabled MinGW's entry in the PATH.

I am trying to get the following code to work:

library(Rcpp)
library(inline)

body - '
NumericVector xx(x);
return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));'

add - cxxfunction(signature(x = numeric), body, plugin = Rcpp, 
verbose=T)

x - 1
y - 2
res - add(c(x, y))
res


I get the following error messages:

setting environment variables:
PKG_LIBS =  C:/Users/Owe/Documents/R/win-library/2.15/Rcpp/lib/x64/libRcpp.a

LinkingTo : Rcpp
CLINK_CPPFLAGS =  -IC:/Users/Owe/Documents/R/win-library/2.15/Rcpp/include

Program source :

1 :
2 : // includes from the plugin
3 :
4 : #includeRcpp.h
5 :
6 :
7 : #ifndef BEGIN_RCPP
8 : #define BEGIN_RCPP
9 : #endif
   10 :
   11 : #ifndef END_RCPP
   12 : #define END_RCPP
   13 : #endif
   14 :
   15 : using namespace Rcpp;
   16 :
   17 :
   18 : // user includes
   19 :
   20 :
   21 : // declarations
   22 : extern C {
   23 : SEXP file10bc7da0783e( SEXP x) ;
   24 : }
   25 :
   26 : // definition
   27 :
   28 : SEXP file10bc7da0783e( SEXP x ){
   29 : BEGIN_RCPP
   30 :
   31 : NumericVector xx(x);
   32 : return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));
   33 : END_RCPP
   34 : }
   35 :
   36 :
Compilation argument:
  C:/R_curr/R_2_15_0/bin/x64/R CMD SHLIB file10bc7da0783e.cpp 2  
file10bc7da0783e.cpp.err.txt
g++ -m64 -IC:/R_curr/R_2_15_0/include -DNDEBUG
-IC:/Users/Owe/Documents/R/win-library/2.15/Rcpp/include 
-Id:/RCompile/CRANpkg/extralibs64/local/include -O2 -Wall  -mtune=core2 
-c file10bc7da0783e.cpp -o file10bc7da0783e.o
g++ -m64 -shared -s -static-libgcc -o file10bc7da0783e.dll tmp.def 
file10bc7da0783e.o 
C:/Users/Owe/Documents/R/win-library/2.15/Rcpp/lib/x64/libRcpp.a 
-Ld:/RCompile/CRANpkg/extralibs64/local/lib/x64 
-Ld:/RCompile/CRANpkg/extralibs64/local/lib -LC:/R_curr/R_2_15_0/bin/x64 -lR
cygwin warning:
   MS-DOS style path detected: C:/R_curr/R_2_15_0/etc/x64/Makeconf
   Preferred POSIX equivalent is: /cygdrive/c/R_curr/R_2_15_0/etc/x64/Makeconf
   CYGWIN environment variable option nodosfilewarning turns off this warning.
   Consult the user's guide for more details about POSIX paths:
 http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Cannot export Rcpp::Vector14::update(): symbol not defined
Cannot export Rcpp::Vector14::~Vector(): symbol not defined
Cannot export Rcpp::Vector14::~Vector(): symbol not defined
Cannot export typeinfo for Rcpp::VectorBase14, true, Rcpp::Vector14  : 
symbol not defined
Cannot export typeinfo for Rcpp::Vector14: symbol not defined
Cannot export typeinfo for Rcpp::traits::expands_to_logical__impl14: symbol 
not defined
Cannot export typeinfo for Rcpp::RObject: symbol not defined
Cannot export typeinfo for Rcpp::internal::eval_methods14: symbol not defined
Cannot export typeinfo for std::exception: symbol not defined
Cannot export typeinfo name for Rcpp::VectorBase14, true, Rcpp::Vector14  : 
symbol not defined
Cannot export typeinfo name for Rcpp::Vector14: symbol not defined
Cannot export typeinfo name for Rcpp::traits::expands_to_logical__impl14: 
symbol not defined
Cannot export typeinfo name for Rcpp::RObject: symbol not defined
Cannot export typeinfo name for Rcpp::internal::eval_methods14: symbol not 
defined
Cannot export typeinfo name for std::exception: symbol not defined
Cannot export vtable for Rcpp::Vector14: symbol not defined
Cannot export _file10bc7da0783e: symbol not defined
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x1a4): undefined reference to 
`SEXPREC* Rcpp::internal::r_true_cast14(SEXPREC*)'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x1c9): undefined reference to 
`Rcpp::RObject::setSEXP(SEXPREC*)'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x244): undefined reference to 
`double* Rcpp::internal::r_vector_start14, double(SEXPREC*)'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x27c): undefined reference to 
`Rcpp::RObject::~RObject()'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x389): undefined reference to 
`Rcpp::RObject::~RObject()'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x420): undefined reference to 
`forward_exception_to_r(std::exception const)'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text$_ZN4Rcpp6VectorILi14EED1Ev[Rcpp::Vector14::~Vector()]+0x13):
 undefined reference to `Rcpp::RObject::~RObject()'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text$_ZN4Rcpp6VectorILi14EE6updateEv[Rcpp::Vector14::update()]+0xd):
 undefined reference to `double* Rcpp::internal::r_vector_start14, 
double(SEXPREC*)'
file10bc7da0783e.o:file10bc7da0783e.cpp:(.text$_ZN4Rcpp6VectorILi14EED0Ev[Rcpp::Vector14::~Vector()]+0x13):
 undefined reference to `Rcpp::RObject::~RObject()'
collect2: ld returned 1 exit status

ERROR(s) during compilation: 

Re: [Rd] Setting up a windows system for rcpp

2012-05-03 Thread Dirk Eddelbuettel

On 4 May 2012 at 00:07, Owe Jessen wrote:
| I am running into a wall getting my system to work with rcpp and inline. 
| Following Dirk's advice on stackoverflow, I hope someone is able to help 
| me.

There is a dedicated mailing list for Rcpp:  rcpp-devel.   

Please let us try to continue the discussion over there. Subscription is
required as on some other R lists, so please subscribe before posting.


In general, you need Rtools correctly set up. If and when you compile a basic
R package (also containing C or C++ files) from sources, you should be fine.

A decent 60+ page tutorial is available at:

  
http://howtomakeanrpackage.pbworks.com/f/How_To_Make_An_R_Package-v1.14-01-11-10.pdf
 
Once you have that sorted out, working with Rcpp and inline should just
work as it does on other operating systems.

| My steps were to install MinGW 32 bit first, then installing Rtools, I 
| disabled MinGW's entry in the PATH.

What do you mean by MinGW's path entry disabled ?  You need mingw.
 
| I am trying to get the following code to work:
| 
| library(Rcpp)
| library(inline)
| 
| body - '
| NumericVector xx(x);
| return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));'
| 
| add - cxxfunction(signature(x = numeric), body, plugin = Rcpp, 
| verbose=T)
| 
| x - 1
| y - 2
| res - add(c(x, y))
| res
| 
| 
| I get the following error messages:
| 
| setting environment variables:
| PKG_LIBS =  C:/Users/Owe/Documents/R/win-library/2.15/Rcpp/lib/x64/libRcpp.a
| 
| LinkingTo : Rcpp
| CLINK_CPPFLAGS =  -IC:/Users/Owe/Documents/R/win-library/2.15/Rcpp/include
| 
| Program source :
| 
| 1 :
| 2 : // includes from the plugin
| 3 :
| 4 : #includeRcpp.h
| 5 :
| 6 :
| 7 : #ifndef BEGIN_RCPP
| 8 : #define BEGIN_RCPP
| 9 : #endif
|10 :
|11 : #ifndef END_RCPP
|12 : #define END_RCPP
|13 : #endif
|14 :
|15 : using namespace Rcpp;
|16 :
|17 :
|18 : // user includes
|19 :
|20 :
|21 : // declarations
|22 : extern C {
|23 : SEXP file10bc7da0783e( SEXP x) ;
|24 : }
|25 :
|26 : // definition
|27 :
|28 : SEXP file10bc7da0783e( SEXP x ){
|29 : BEGIN_RCPP
|30 :
|31 : NumericVector xx(x);
|32 : return wrap( std::accumulate( xx.begin(), xx.end(), 0.0));
|33 : END_RCPP
|34 : }
|35 :
|36 :
| Compilation argument:
|   C:/R_curr/R_2_15_0/bin/x64/R CMD SHLIB file10bc7da0783e.cpp 2  
file10bc7da0783e.cpp.err.txt
| g++ -m64 -IC:/R_curr/R_2_15_0/include -DNDEBUG
-IC:/Users/Owe/Documents/R/win-library/2.15/Rcpp/include 
-Id:/RCompile/CRANpkg/extralibs64/local/include -O2 -Wall  -mtune=core2 
-c file10bc7da0783e.cpp -o file10bc7da0783e.o

Looks like compilation worked.

| g++ -m64 -shared -s -static-libgcc -o file10bc7da0783e.dll tmp.def 
file10bc7da0783e.o 
C:/Users/Owe/Documents/R/win-library/2.15/Rcpp/lib/x64/libRcpp.a 
-Ld:/RCompile/CRANpkg/extralibs64/local/lib/x64 
-Ld:/RCompile/CRANpkg/extralibs64/local/lib -LC:/R_curr/R_2_15_0/bin/x64 -lR
| cygwin warning:
|MS-DOS style path detected: C:/R_curr/R_2_15_0/etc/x64/Makeconf
|Preferred POSIX equivalent is: /cygdrive/c/R_curr/R_2_15_0/etc/x64/Makeconf
|CYGWIN environment variable option nodosfilewarning turns off this 
warning.
|Consult the user's guide for more details about POSIX paths:
|  http://cygwin.com/cygwin-ug-net/using.html#using-pathnames

That is just noise and can be ignored.

The rest is bad:

| Cannot export Rcpp::Vector14::update(): symbol not defined
| Cannot export Rcpp::Vector14::~Vector(): symbol not defined
| Cannot export Rcpp::Vector14::~Vector(): symbol not defined
| Cannot export typeinfo for Rcpp::VectorBase14, true, Rcpp::Vector14  : 
symbol not defined
| Cannot export typeinfo for Rcpp::Vector14: symbol not defined
| Cannot export typeinfo for Rcpp::traits::expands_to_logical__impl14: symbol 
not defined
| Cannot export typeinfo for Rcpp::RObject: symbol not defined
| Cannot export typeinfo for Rcpp::internal::eval_methods14: symbol not 
defined
| Cannot export typeinfo for std::exception: symbol not defined
| Cannot export typeinfo name for Rcpp::VectorBase14, true, Rcpp::Vector14  
: symbol not defined
| Cannot export typeinfo name for Rcpp::Vector14: symbol not defined
| Cannot export typeinfo name for Rcpp::traits::expands_to_logical__impl14: 
symbol not defined
| Cannot export typeinfo name for Rcpp::RObject: symbol not defined
| Cannot export typeinfo name for Rcpp::internal::eval_methods14: symbol not 
defined
| Cannot export typeinfo name for std::exception: symbol not defined
| Cannot export vtable for Rcpp::Vector14: symbol not defined
| Cannot export _file10bc7da0783e: symbol not defined
| file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x1a4): undefined reference to 
`SEXPREC* Rcpp::internal::r_true_cast14(SEXPREC*)'
| file10bc7da0783e.o:file10bc7da0783e.cpp:(.text+0x1c9): undefined reference to 
`Rcpp::RObject::setSEXP(SEXPREC*)'
| 

Re: [Rd] The constant part of the log-likelihood in StructTS

2012-05-03 Thread Ravi Varadhan
Thanks, Tom, for the reply as well as to the reference to Claeskens  Hjort.

Ravi

From: Thomas Lumley [tlum...@uw.edu]
Sent: Thursday, May 03, 2012 4:41 PM
To: Mark Leeds
Cc: Ravi Varadhan; r-devel@r-project.org
Subject: Re: [Rd] The constant part of the log-likelihood in StructTS

On Thu, May 3, 2012 at 3:36 AM, Mark Leeds marklee...@gmail.com wrote:
 Hi Ravi: As far as I know ( well , really read ) and Bert et al can say
 more , the AIC is not dependent on the models being nested as long as the
 sample sizes used are the same when comparing. In some cases, say comparing
 MA(2), AR(1), you have to be careful with sample size usage but there is no
 nesting requirement for AIC atleast, I'm pretty sure.

This is only partly true.  The expected value of the AIC will behave
correctly even if models are non-nested, but there is no general
guarantee that the standard deviation is small,  so AIC need not even
asymptotically lead to optimal model choice for prediction in
arbitrary non-nested models.

Having said that, 'nearly' nested models like these are probably ok.
I believe it's sufficient that all your models are nested in a common
model, with a bound on the degree of freedom difference, but my copy
of Claeskens  Hjort's book on model selection and model averaging is
currently with a student so I can't be definitive.


   -thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

2012-05-03 Thread Jeff Ryan
A bit late and possibly tangential. 

The mmap package has something called struct() which is really a row-wise array 
of heterogenous columns.

As Simon and others have pointed out, R has no way to handle this natively, but 
mmap does provide a very measurable performance gain by orienting rows together 
in memory (mapped memory to be specific).  Since it is all outside of R so to 
speak, it (mmap) even supports many non-native types, from bit vectors to 64 
bit ints with conversion caveats applicable. 

example(struct) shows some performance gains with this approach. 

There are even some crude methods to convert as is data.frames to mmap struct 
object directly (hint: as.mmap)

Again, likely not enough to shoehorn into your effort, but worth a look to see 
if it might be useful, and/or see the C design underlying it. 

Best,
Jeff

Jeffrey Ryan|Founder|jeffrey.r...@lemnica.com

www.lemnica.com

On May 1, 2012, at 1:44 PM, Antonio Piccolboni anto...@piccolboni.info wrote:

 On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek
 simon.urba...@r-project.orgwrote:
 
 
 On May 1, 2012, at 1:26 PM, Antonio Piccolboni anto...@piccolboni.info
 wrote:
 
 It seems like people need to hear more context, happy to provide it. I am
 implementing a serialization format (typedbytes, HADOOP-1722 if people
 want
 the gory details) to make R and Hadoop interoperate better (RHadoop
 project, package rmr). It is a row first format and it's already
 implemented as a C extension for R for lists and atomic vectors, where
 each
 element  of a vector is a row. I need to extend it to accept data frames
 and I was wondering if I can use the existing C code by converting a data
 frame to a list of its rows. It sounds like the answer is that it is not
 a
 good idea,
 
 Just think about it -- data frames are lists of *columns* because the type
 of each column is fixed. Treating them row-wise is extremely inefficient,
 because you can't use any vector type to represent such thing (other than a
 generic vector containing vectors of length 1).
 
 
 Thanks, let's say this together with the experiments and other converging
 opinions lays the question to rest.
 
 
 that's helpful too in a way because it restricts the options. I
 thought I may be missing a simple primitive, like a t() for data frames
 (that doesn't coerce to matrix).
 
 See above - I think you are misunderstanding data frames - t() makes no
 sense for data frames.
 
 
 I think you are misunderstanding my use of t(). Thanks
 
 
 Antonio
 
 
 
 Cheers,
 Simon
 
 
 
 On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley rip...@stats.ox.ac.uk
 wrote:
 
 On 01/05/2012 00:28, Antonio Piccolboni wrote:
 
 Hi,
 I was wondering if there is anything more efficient than split to do
 the
 kind of conversion in the subject. If I create a data frame as in
 
 system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
 paste(x,
 1:2000, sep =))})
 user  system elapsed
 0.004   0.000   0.004
 
 and then I try to split it
 
 system.time(split(fd, 1:nrow(fd)))
 
 user  system elapsed
 0.333   0.031   0.415
 
 
 You will be quick to notice the roughly two orders of magnitude
 difference
 in time between creation and conversion. Granted, it's not written
 anywhere
 
 
 Unsurprising when you create three orders of magnitude more data frames,
 is it?  That's a list of 2000 data frames.  Try
 
 system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
 paste0(x, i)))
 
 
 
 that they should be similar but the latter seems interpreter-slow to me
 (split is implemented with a lapply in the data frame case) There is
 also
 a
 memory issue when I hit about 2 elements (allocating 3GB when
 interrupted). So before I resort to Rcpp, despite the electrifying
 feeling
 of approaching the bare metal and for the sake of getting things done,
 I
 thought I would ask the experts. Thanks
 
 
 You need to re-think your data structures: 1-row data frames are not
 sensible.
 
 
 
 
 Antonio
 
 [[alternative HTML version deleted]]
 
 
 __**
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-devel
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
 --
 Brian D. Ripley,  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/
 http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595
 
 
 [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 
 
 
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org