[R] Time Series prediction

2013-05-27 Thread Giovanni Azua
Hello,

I would like to use a parametric TS model and predictor as benchmark to
compare against other ML methods I'm employing. I currently build a simple
e.g. ARIMA model using the convenient auto.arima function like this:

library(forecast)
df <- read.table("/Users/bravegag/data/myts.dat")
# btw my test data doesn't have seasonality but periodicity so the value
# 2 is arbitrarily set, using a freq of yearly or 1 would make unhappy some
# R ts functions
tsdata <- ts(df$signal, freq=2)
arimamodel <- auto.arima(tsdata, max.p=15, max.q=10, stationary=FALSE,
ic="bic", stepwise=TRUE, seasonal=FALSE, parallel=FALSE, num.cores=4,
trace=TRUE, allowdrift=TRUE)
arimapred <- forecast.Arima(arimamodel, h=20)
plot(arimapred)

The problem is that the forecast.Arima function is apparently doing a "free
run" i.e. it uses the forecast(t+1) value as input to compute forecast(t+2)
and I'm instead interested in a prediction mode where it always use the
observed tsdata(t+1) value to predict forecast(t+2), the observed
tsdata(t+2) to predict forecast(t+3) and so on.

Can anyone please advice how to achieve this?

TIA,
Best regards,
Giovanni

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text mining

2013-01-26 Thread Giovanni Azua
Hi Steve,

IMO this problem does not need a classifier but rather a database and a
simple query. I would just build a database with all city names including
the geo information, and then say whether it is north or south exactly. 

If there was such a "rule" (which I doubt) I would expect it to have many
exceptions and therefore a bunch of false-positives on both sides. Why
overcomplicate a simple problem? 

HTH,
Ciao,
Giovanni

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Steve Stephenson
Sent: Saturday, January 26, 2013 10:08 PM
To: r-help@r-project.org
Subject: [R] Text mining

Hallo to everybody,
I would like to perform an analysis but I don't know how to proceed and
whether R packages are available for my purpose or not. Therefore I'm here
to request your support.
*The idea is the following:* I noticed that the names of the towns and
villages in northern Italy most of the time sound differently from names of
cities based on southern Italy. Just to give you an idea "Caronno
Pertusella" is a northern Italy village while Frascati is a center Italy
town. Most of the time I am able to recognize where the town is located just
hearing the name but I cannot say why, that is to say that I didn't find a
"rule".
What I would like to do is to find a classification rule/engine that is able
to "locate" the city starting from its name. *I think the classification
method should be based on the sequence of letters belonging to the town's
name*. But this is just an intuition not yet formalized!
I know that mine is a strange request and idea, anyway advices are very
appreciated and welcome!
Many thanks in advance to all.

Steve



--
View this message in context:
http://r.789695.n4.nabble.com/Text-mining-tp4656732.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] use subset to trim data but include last per category

2012-09-09 Thread Giovanni Azua
Hello,

This solves my problem in a horribly inelegant way that works:

df <- data.frame(n=newInput$n, iter=newInput$iter, Error=newInput$Error, 
Duality_Gap=newInput$Duality, Runtime=newInput$Acc)
df_last <- aggregate(x=df$iter, by=list(df$n), FUN=max)
names(df_last)[names(df_last)=="Group.1"] <- "n"
names(df_last)[names(df_last)=="x"] <- "iter"
# n  iter
#1 1000  2518
#2 2000  5700
#3 3000 10026
#4 4000 13916
#5 5000 17962

df$last <- FALSE
df$last[df$n == 1000 & df$iter == 2518] <- TRUE
df$last[df$n == 2000 & df$iter == 5700] <- TRUE
df$last[df$n == 3000 & df$iter == 10026] <- TRUE
df$last[df$n == 4000 & df$iter == 13916] <- TRUE
df$last[df$n == 5000 & df$iter == 17962] <- TRUE

df <- subset(df, (iter %% 500 == 0) | (df$last == TRUE))

How can I do the same without hardwiring these numbers?

TIA,
Best regards,
Giovanni
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] use subset to trim data but include last per category

2012-09-09 Thread Giovanni Azua
Hi Jeff,

Thanks for your help, but this doesn't work, there are two problems. First and 
most important I need to keep the last _per category_ where my category is n 
and not the last globally. Second, there seems to be an issue with the subset 
variation that ends up not filtering anything ... but this is a minor thing.

Best.
Giovanni

On Sep 9, 2012, at 5:59 PM, Jeff Newmiller wrote:

> dfthin <- df[ c(which(iter %% 500 == 0),nrow(df) ]
> 
> or
> 
> dfthin <- subset(df, (iter %% 500 == 0) | (seq.int(nrow(df)==nrow(df)))
> 
> N.B. You should avoid using the name "df" for your variables, because it is 
> the name of a built-in function that you are hiding by doing so. Others may 
> be confused, and eventually you may want to use that function yourself. One 
> solution is to use DF for your variables... another is to use more 
> descriptive names.
> ---
> Jeff NewmillerThe .   .  Go Live...
> DCN:Basics: ##.#.   ##.#.  Live Go...
>  Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
> /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
> ------- 
> Sent from my phone. Please excuse my brevity.
> 
> Giovanni Azua  wrote:
> 
>> Hello,
>> 
>> I bumped into the following funny use-case. I have too much data for a
>> given plot. I have the following data frame df: 
>> 
>>> str(df)
>> 'data.frame':5015 obs. of  5 variables:
>> $ n  : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1
>> ...
>> $ iter   : int  10 20 30 40 50 60 70 80 90 100 ...
>> $ Error  : num  1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ...
>> $ Duality_Gap: num  20080 3789 855 443 321 ...
>> $ Runtime: num  0.00536 0.01353 0.01462 0.01571 0.01681 ...
>> 
>> But if I plot e.g. Runtime vs log(Duality Gap) I have too many
>> observations due to taking a snapshot every 10 iterations rather than
>> say 500 and the plot looks very cluttered. So I would like to trim the
>> data frame including only those records for which iter is multiple of
>> 500 and so I do this:
>> 
>> df <- subset(df, iter %% 500 == 0)
>> 
>> This gives me almost exactly what I need except that the last and most
>> important Duality Gap observations are of course gone due to the
>> filtering ... I would like to change the subset clause to be iter %%
>> 500 _or_ the record is the last per n (n is my problem size and
>> category in this case) ... how can I do that?
>> 
>> I thought of adding a new column that flags whether a given row is the
>> last element per category as "last" Boolean but this is a bit too
>> complicated .. is there a simpler condition construct that can be used
>> with the subset command?
>> 
>> TIA,
>> Best regards,
>> Giovanni
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] use subset to trim data but include last per category

2012-09-09 Thread Giovanni Azua
Hello,

I bumped into the following funny use-case. I have too much data for a given 
plot. I have the following data frame df: 

> str(df)
'data.frame':   5015 obs. of  5 variables:
 $ n  : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ iter   : int  10 20 30 40 50 60 70 80 90 100 ...
 $ Error  : num  1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ...
 $ Duality_Gap: num  20080 3789 855 443 321 ...
 $ Runtime: num  0.00536 0.01353 0.01462 0.01571 0.01681 ...

But if I plot e.g. Runtime vs log(Duality Gap) I have too many observations due 
to taking a snapshot every 10 iterations rather than say 500 and the plot looks 
very cluttered. So I would like to trim the data frame including only those 
records for which iter is multiple of 500 and so I do this:

df <- subset(df, iter %% 500 == 0)

This gives me almost exactly what I need except that the last and most 
important Duality Gap observations are of course gone due to the filtering ... 
I would like to change the subset clause to be iter %% 500 _or_ the record is 
the last per n (n is my problem size and category in this case) ... how can I 
do that?

I thought of adding a new column that flags whether a given row is the last 
element per category as "last" Boolean but this is a bit too complicated .. is 
there a simpler condition construct that can be used with the subset command?

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Grid package: how to customize cell spacing?

2012-09-08 Thread Giovanni Azua
Hello,

I am using the recipe below to place plots side by side: 
http://wiki.stdout.org/rcookbook/Graphs/Multiple%20graphs%20on%20one%20page%20%28ggplot2%29/

How can I reduce or customize the horizontal spacing between the grid cells? I 
have researched the Grid package but can't find the way to do so.

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] simplest way (set of functions) to parse a file

2012-08-27 Thread Giovanni Azua
;Step1','Step2','Step3', 'Step4'))
> > x
>   n iter variable   value
> 1102Step1 0.001779780
> 2204Step1 0.000955757
> 364   10Step1 0.001613035
> 464   20Step1 0.0
> 564   30Step1 0.0
> 664   40Step1 0.0
> 764   50Step1 0.0
> 864   53Step1 0.0
> 9     10Step1 0.002975209
> 10    20Step1 0.0
> 11    30Step1 0.0
> 12    40Step1 0.0
> 13    50Step1 0.0
> 14    60Step1 0.0
> 15    70Step1 0.0
> 16    80Step1 0.0
> 17    81Step1 0.0
> 18   102Step2 0.72146
> 19   204Step2 0.000230161
> 20   64   10Step2 0.003956920
> 21   64   20Step2 0.004390998
> 22   64   30Step2 0.004326610
> 23   64   40Step2 0.004329726
> 24   64   50Step2 0.004327526
> 25   64   53Step2 0.001327239
> 26    10Step2 0.018730820
> 27    20Step2 0.020846370
> 28    30Step2 0.020831700
> 29    40Step2 0.020882840
> 30    50Step2 0.020459540
> 31    60Step2 0.020825860
> 32    70Step2 0.020832040
> 33    80Step2 0.020335710
> 34    81Step2 0.002100072
> 
> 
> On Mon, Aug 27, 2012 at 6:24 AM, Giovanni Azua  wrote:
> Hello,
> 
> What would be the best set of R functions to parse and transform a file?
> 
> My file looks as shown below. I would like to plot this data and I need to 
> parse it into a single data frame that sorts of "transposes the data" with 
> the following structure:
> 
> > df <- data.frame(n=c(1,1,2,2),iter=c(1,2,1,2),step=as.factor(c('Step 1', 
> > 'Step2', 'Step 1', 'Step 2')),value=c(10, 10, 10, 10))
> > str(df)
> 'data.frame':   4 obs. of  4 variables:
>  $ n: num  1 1 2 2
>  $ iter : num  1 2 1 2
>  $ step : Factor w/ 3 levels "Step 1","Step 2",..: 1 3 1 2
>  $ value: num  10 10 10 10
> 
> n=extracted from the file name "logdet_two_moons_n>>>>10<<<<.txt"
> iter=iter
> step=column Step1, Step2, Step3, Step4
> value=value of the specific Step column
> 
> And this is one possible data frame variation to be able to plot the time 
> proportions for the different steps of my algorithm.
> 
> TIA,
> Best regards,
> Giovanni
> 
> Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2 
>   Step3   Step4
> 2   2   0.00e+001.912976e-031   0.00e+00
> 1.779780e-037.214600e-051.243600e-052.246700e-05
> ../test/genmoons_data/logdet_two_moons_n10.txt,2,2,1.754115e-02,0.00e+00,9.799000e+03,0.00e+00,5.586293e-01,0.00e+00
> Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2 
>   Step3   Step4
> 4   9   0.00e+001.280841e-032   -7.105427e-15   
> 9.557570e-042.301610e-041.571100e-052.177300e-05
> ../test/genmoons_data/logdet_two_moons_n20.txt,4,5,6.062756e-03,0.00e+00,1.365970e+05,0.00e+00,2.253051e+01,0.00e+00
> Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2 
>   Step3   Step4
> 10  32  3.133476e-036.075853e-038   4.057531e-01
> 1.613035e-033.956920e-033.077200e-054.390900e-05
> 20  28  5.597685e-044.376530e-0316  4.711146e-03
> 0.00e+004.390998e-032.229600e-052.517100e-05
> 30  27  1.148159e-044.357923e-0322  8.408166e-06
> 0.00e+004.326610e-032.697700e-053.233200e-05
> 40  27  4.036778e-054.388260e-0329  2.529294e-07
> 0.00e+004.329726e-032.713000e-053.558400e-05
> 50  27  1.840383e-054.357373e-0336  1.34e-07
> 0.00e+004.327526e-033.097000e-053.255700e-05
> 53  27  0.00e+001.322382e-0336  -2.842171e-14   
> 0.00e+001.327239e-031.420400e-052.043500e-05
> ../test/genmoons_data/logdet_two_moons_n64.txt,53,69,3.330987e-02,0.00e+00,2.229830e+07,0.00e+00,6.694201e+02,0.00e+00
> Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2 
>   Step3   Step4
> 10  70  7.739525e-032.389529e-028   1.494829e+00
> 2.975209e-031.873082e-024.713600e-055.837200e-05
> 20  74  3.379192e-032.084753e-0215  3.372041e-01
> 0.00e+002.084637e-024.302400e-053.907800e-05
> 30  76  1.322821e-032.093204e-022

[R] simplest way (set of functions) to parse a file

2012-08-27 Thread Giovanni Azua
Hello,

What would be the best set of R functions to parse and transform a file?

My file looks as shown below. I would like to plot this data and I need to 
parse it into a single data frame that sorts of "transposes the data" with the 
following structure:

> df <- data.frame(n=c(1,1,2,2),iter=c(1,2,1,2),step=as.factor(c('Step 1', 
> 'Step2', 'Step 1', 'Step 2')),value=c(10, 10, 10, 10))
> str(df)
'data.frame':   4 obs. of  4 variables:
 $ n: num  1 1 2 2
 $ iter : num  1 2 1 2
 $ step : Factor w/ 3 levels "Step 1","Step 2",..: 1 3 1 2
 $ value: num  10 10 10 10

n=extracted from the file name "logdet_two_moons_n10.txt"
iter=iter
step=column Step1, Step2, Step3, Step4
value=value of the specific Step column 

And this is one possible data frame variation to be able to plot the time 
proportions for the different steps of my algorithm. 

TIA,
Best regards,
Giovanni

Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2   
Step3   Step4
2   2   0.00e+001.912976e-031   0.00e+00
1.779780e-037.214600e-051.243600e-052.246700e-05
../test/genmoons_data/logdet_two_moons_n10.txt,2,2,1.754115e-02,0.00e+00,9.799000e+03,0.00e+00,5.586293e-01,0.00e+00
 
Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2   
Step3   Step4
4   9   0.00e+001.280841e-032   -7.105427e-15   
9.557570e-042.301610e-041.571100e-052.177300e-05
../test/genmoons_data/logdet_two_moons_n20.txt,4,5,6.062756e-03,0.00e+00,1.365970e+05,0.00e+00,2.253051e+01,0.00e+00
 
Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2   
Step3   Step4
10  32  3.133476e-036.075853e-038   4.057531e-01
1.613035e-033.956920e-033.077200e-054.390900e-05
20  28  5.597685e-044.376530e-0316  4.711146e-03
0.00e+004.390998e-032.229600e-052.517100e-05
30  27  1.148159e-044.357923e-0322  8.408166e-06
0.00e+004.326610e-032.697700e-053.233200e-05
40  27  4.036778e-054.388260e-0329  2.529294e-07
0.00e+004.329726e-032.713000e-053.558400e-05
50  27  1.840383e-054.357373e-0336  1.34e-07
0.00e+004.327526e-033.097000e-053.255700e-05
53  27  0.00e+001.322382e-0336  -2.842171e-14   
0.00e+001.327239e-031.420400e-052.043500e-05
../test/genmoons_data/logdet_two_moons_n64.txt,53,69,3.330987e-02,0.00e+00,2.229830e+07,0.00e+00,6.694201e+02,0.00e+00
 
Iteration   JminError   Elapsed Corral  Duality Gap Step1   Step2   
Step3   Step4
10  70  7.739525e-032.389529e-028   1.494829e+00
2.975209e-031.873082e-024.713600e-055.837200e-05
20  74  3.379192e-032.084753e-0215  3.372041e-01
0.00e+002.084637e-024.302400e-053.907800e-05
30  76  1.322821e-032.093204e-0221  1.018845e-01
0.00e+002.083170e-024.704100e-055.707100e-05
40  78  1.176950e-032.095179e-0228  2.447970e-02
0.00e+002.088284e-024.890700e-054.955100e-05
50  78  2.233669e-042.050571e-0235  1.573952e-02
0.00e+002.045954e-024.046600e-053.899000e-05
60  78  2.167956e-042.095130e-0239  8.362982e-03
0.00e+002.082586e-026.699700e-058.506400e-05
70  78  2.085968e-042.085355e-0246  5.135190e-03
0.00e+002.083204e-025.432900e-054.078600e-05
80  78  2.570800e-042.044932e-0251  5.470225e-04
0.00e+002.033571e-025.334200e-055.318400e-05
81  78  0.00e+002.099610e-0351  1.421085e-14
0.00e+002.100072e-039.147000e-062.324800e-05
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [slightly OT] le: will a new point shift the solution question

2012-03-23 Thread Giovanni Azua
Hello,

Is there an R function that given a linear regression solution for a data set 
will answer in the most efficient way whether a new data point shifts the 
solution or not? or whether the new solution would differ by less than some 
error.

I need this in the context of an iterative method and such a function would 
spare a lot of time. 

The closest answer I can find to this, involves keeping track of the QR and 
updating it with a row append. I would like to only get a boolean answer 
because a 'no' answer would spare tons of flops.

Many thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] glm predict issue

2011-12-26 Thread Giovanni Azua
Hi Ben,

Yes thanks you are right, I was able to fix it but first I had to fix the data 
frame over which I built my model to use numeric for those and then making the 
grid values also numeric it finally worked thanks!

Thank you for your help!
Best regards,
Giovanni

On Dec 26, 2011, at 4:57 PM, Ben Bolker wrote:

> Giovanni Azua  gmail.com> writes:
> 
>> 
>> Hello,
>> 
>> I have tried reading the documentation and googling for the answer but
> reviewing the online matches I end up
>> more confused than before.
>> 
>> My problem is apparently simple. I fit a glm model (2^k experiment), and then
> I would like to predict the
>> response variable (Throughput) for unseen factor levels.
>> 
>> When I try to predict I get the following error:
>>> throughput.pred <- predict(throughput.fit,experiments,type="response")
>> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev =
> object$xlevels) : 
>>  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
>> 
>> Of course these are new factor levels, it is exactly what I am trying to
> achieve i.e. extrapolate the values
>> of Throughput.
>> 
>> Can anyone please advice? Below I include all details.
> 
>  Any predictors that you want to treat as continuous
> (which would be the only way you can extrapolate to unobserved
> values) should be numeric, not factor variables -- use 
> 
> mydata <- transform(mydata, var=as.numeric(var))
> 
> for example.
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] glm predict issue

2011-12-26 Thread Giovanni Azua
Hello,

I have tried reading the documentation and googling for the answer but 
reviewing the online matches I end up more confused than before.

My problem is apparently simple. I fit a glm model (2^k experiment), and then I 
would like to predict the response variable (Throughput) for unseen factor 
levels.

When I try to predict I get the following error:
> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = 
object$xlevels) : 
  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000

Of course these are new factor levels, it is exactly what I am trying to 
achieve i.e. extrapolate the values of Throughput.

Can anyone please advice? Below I include all details.

Thanks in advance,
Best regards,
Giovanni

> # define the extreme (factors and levels)
> experiments <- expand.grid(No_databases   = seq(1000,100,by=-200), 
+  Partitioning   = 
c("sharding", "replication"), 
+  No_middlewares = 
seq(500,100,by=-100), 
+  Queue_size = c(100))
> experiments$No_databases <- as.factor(experiments$No_databases)
> experiments$Partitioning <- as.factor(experiments$Partitioning)
> experiments$No_middlewares <- as.factor(experiments$No_middlewares)
> experiments$Queue_size <- as.factor(experiments$Queue_size)   
>
> str(experiments)
'data.frame':   50 obs. of  4 variables:
 $ No_databases  : Factor w/ 5 levels "200","400","600",..: 5 4 3 2 1 5 4 3 2 1 
...
 $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 2 2 2 
2 2 ...
 $ No_middlewares: Factor w/ 5 levels "100","200","300",..: 5 5 5 5 5 5 5 5 5 5 
...
 $ Queue_size: Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim : Named int  5 2 5 1
  .. ..- attr(*, "names")= chr  "No_databases" "Partitioning" "No_middlewares" 
"Queue_size"
  ..$ dimnames:List of 4
  .. ..$ No_databases  : chr  "No_databases=1000" "No_databases= 800" 
"No_databases= 600" "No_databases= 400" ...
  .. ..$ Partitioning  : chr  "Partitioning=sharding" "Partitioning=replication"
  .. ..$ No_middlewares: chr  "No_middlewares=500" "No_middlewares=400" 
"No_middlewares=300" "No_middlewares=200" ...
  .. ..$ Queue_size: chr "Queue_size=100"
> head(experiments)
  No_databases Partitioning No_middlewares Queue_size
1 1000 sharding500100
2  800 sharding500100
3  600 sharding500100
4  400 sharding500100
5  200 sharding500100
6 1000  replication500100
> # or
> throughput.fit <- 
> glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size,
+   data=throughput)
> summary(throughput.fit)

Call:
glm(formula = log(Throughput) ~ (No_databases * No_middlewares) + 
Partitioning + Queue_size, data = throughput)

Deviance Residuals: 
Min   1Q   Median   3Q  Max  
-2.5966  -0.6612  -0.1944   0.5548   3.2136  

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept)5.747010.09127  62.970  < 2e-16 ***
No_databases4  0.433090.10985   3.943 8.66e-05 ***
No_middlewares2   -1.993740.11035 -18.067  < 2e-16 ***
No_middlewares4   -1.230040.10969 -11.214  < 2e-16 ***
Partitioningreplication0.332910.06181   5.386 9.15e-08 ***
Queue_size100  0.158500.06181   2.564   0.0105 *  
No_databases4:No_middlewares2  2.715250.15262  17.791  < 2e-16 ***
No_databases4:No_middlewares4  1.941910.15226  12.754  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 0.8921778)

Null deviance: 2175.58  on 936  degrees of freedom
Residual deviance:  828.83  on 929  degrees of freedom
AIC: 2562.2

Number of Fisher Scoring iterations: 2

> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = 
object$xlevels) : 
  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] data frame and cumulative sum

2011-12-07 Thread Giovanni Azua
Thank you Michael,

indeed, my bad, I get so deep in trying to solve the problem that forget to try 
the most basic help first.  

Best regards,
Giovanni

On Dec 7, 2011, at 11:20 PM, R. Michael Weylandt wrote:
> ??"cumulative sum" would almost certainly lead you to cumsum with only
> a modicum of effort
> 
> Michael
> 
> On Wed, Dec 7, 2011 at 5:13 PM, Giovanni Azua  wrote:
>> Hello,
>> 
>> I have a data frame that looks like this (containing interarrival times):
>> 
>>> str(df)
>> 'data.frame':   18233 obs. of  1 variable:
>>  $ Interarrival: int  135 806 117 4 14 1 9 104 169 0 ...
>>> head(df)
>>  Interarrival
>> 1  135
>> 2  806
>> 3  117
>> 44
>> 5   14
>> 61
>>> 
>> 
>> This corresponds to the time differences (in ms) of a poisson arrival 
>> process where Interarrival{i+1} = time_{i+1} - time_{i}
>> 
>> I want to get the Time bin (in minutes) of every interarrival basically 
>> something like:
>> 
>> 1) df$Time <- sum(of all df$Interarrival up to "this rownum") # cumulative 
>> sum
>> 
>> 2) df$Time <- floor(df$Time / 6) + 1
>> 
>> then I should get the first minute of Interarrival having 1 and so forth. 
>> The problem is I have no idea how to accomplish 1) in R.
>> 
>> Can anyone advice?
>> 
>> Thanks in advance,
>> Best regards,
>> Giovanni
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] data frame and cumulative sum

2011-12-07 Thread Giovanni Azua
Hello,

I have a data frame that looks like this (containing interarrival times):

> str(df)
'data.frame':   18233 obs. of  1 variable:
 $ Interarrival: int  135 806 117 4 14 1 9 104 169 0 ...
> head(df)
  Interarrival
1  135
2  806
3  117
44
5   14
61
> 

This corresponds to the time differences (in ms) of a poisson arrival process 
where Interarrival{i+1} = time_{i+1} - time_{i}

I want to get the Time bin (in minutes) of every interarrival basically 
something like:

1) df$Time <- sum(of all df$Interarrival up to "this rownum") # cumulative sum
 
2) df$Time <- floor(df$Time / 6) + 1

then I should get the first minute of Interarrival having 1 and so forth. The 
problem is I have no idea how to accomplish 1) in R.

Can anyone advice? 

Thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R-latex syntax highlighting?

2011-11-23 Thread Giovanni Azua
Hello,

Can anyone provide or point me to a good setup for the listings latex package 
that would produce nice R-syntax highlighting?

I am using an example I found in internet for setting up listings like this:
\lstset{
language=R,
basicstyle=\scriptsize\ttfamily,
commentstyle=\ttfamily\color{gray},
numbers=left,
numberstyle=\ttfamily\color{red}\footnotesize,
stepnumber=1,
numbersep=5pt,
backgroundcolor=\color{white},
showspaces=false,
showstringspaces=false,
showtabs=false,
frame=single,
tabsize=2,
captionpos=b,
breaklines=true,
breakatwhitespace=false,
title=\lstname,
escapeinside={},
keywordstyle={},
morekeywords={}
}

But I know that using the color latex package it can look a lot nicer and I 
guess some people might have configured that before and are willing to share it?

I'm also interested in the a similar problem best way to put R code with syntax 
highlighting into a presentation. I use Apple's Keynote. 

Many thanks in advance,
Best regards,
Giovanni
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [OT] 1 vs 2-way anova technical question

2011-11-22 Thread Giovanni Azua

On Nov 22, 2011, at 3:52 PM, Liviu Andronic wrote:

> On Tue, Nov 22, 2011 at 2:09 PM, Giovanni Azua  wrote:
>> Mr. Gunter did not read/understand my problem, and there were no useful tips 
>> but only ad hominem attacks. By your side-taking I suspect you are in the 
>> same "party club" if you want to defend him maybe you should start by "tying 
>> better your dog" so to speak.
>> 
> I believe that most of the readers of this thread got put off by your
> offending and misplaced remarks. To echo other posters, it would be
> nice to get your e-mail address banned from the list.

If needed go ahead and do so, your blockade won't stop my learning efforts. I 
don't see any concrete reason why I should be taking bullets from random people 
who fancy themselves with superiority and arrogance. And as usual these bullies 
always seem to win.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [OT] 1 vs 2-way anova technical question

2011-11-22 Thread Giovanni Azua

On Nov 22, 2011, at 10:35 AM, Joshua Wiley wrote:
> It is true the way you use general lists is not our business, but the
> R-help list is a community and there are community rules.  One of

I meant that my use of the lists is not of __his__ business I wasn't referring 
to you nor other people in this list. Ok the reading skills remark starts to 
get recursive ... and btw the OP even though marked in the subject as OT was 
not entirely so i.e. use of R formula etc. 

> those is not to ask questions that are primarily about a lack of
> statistical understanding (although they are not strictly prohibited).

The "lack of statistical understanding" was his own judgmental conclusion which 
he should have kept for himself, if he starts throwing stones around he should 
not expect to get flowers back. Previous to this I also received some totally 
out of place private emails from him and I am not the kind of person that takes 
B.S. from anyone, he got the wrong guy. And in fact his great fallacious 
conclusions originated from his lack of reading skills, besides I don't really 
think he read it at all but just try to run me down with his attacks and 
unwelcome remarks. 

> Your original post suggests that you knew this, "I know there is
> plenty of people in this group who can give me a good answer :)" but
> chose to ignore it.  Despite this, Bert was generous enough to give
> you some suggestions, perhaps not what you wanted but useful tips
> nonetheless.
> 

My original post only suggested that I know there is people with knowledge 
about this practical applied statistics problems, nothing else. Before 
addressing the list I talked to two TA's and one professor. Their help was 
generic but helpful nevertheless. I preferred to address the question to people 
with practical working knowledge of ANOVA (I don't think there is a huge 
population in this area) and the best place I can think of is the R list, the 
place where I would be subscribed if I worked on these problems every day. 
Statistics lists will be full of college students who will have equivalent 
knowledge to what I already have and there they will probably only agree to say 
yes your QQ looks non-normal and heavy tailed which is what I already knew ... 
this is a similar answer I got from a TA and couple of student friends doing 
the MSc in Statistics track.   

Mr. Gunter did not read/understand my problem, and there were no useful tips 
but only ad hominem attacks. By your side-taking I suspect you are in the same 
"party club" if you want to defend him maybe you should start by "tying better 
your dog" so to speak.

> Regarding your suggestion that the list be split into a "beginner" and
> "advanced" list, while that is one option, your original question was
> appropriate for neither.  It was, however, very appropriate for a
> statistics list (e.g., http://stats.stackexchange.com/).


Thank you for the link, it looks very promising. 

Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [OT] 1 vs 2-way anova technical question

2011-11-21 Thread Giovanni Azua

On Nov 21, 2011, at 8:31 PM, Bert Gunter wrote:
> we disagree is that I think data analysts with limited statistical
> backgrounds should consult with local statisticians instead of trying
> to muddle through on their own thru lists like this. This is not meant

I think that people lacking reading skills should not be subscribed in lists 
like this one, bullying and creating confusion around. 

I will asks as many times as I want/need and the way I use lists if none of 
your f. b. 

> to be arrogance on my part -- though it may seem to come across that
> way -- but rather a plea for good science. I believe that bad
> statistics --> bad science, a problem that I see as pervasive and
> inimical to scientific progress, especially in today's data saturated
> world.

> But enough of my off topic B.S. Please reply privately to not waste
> yet more space here (positively or negatively -- stone throwers need
> to catch them, too).
> 

You are actually full of your off topic prime matter, you arrogant prick.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [OT] 1 vs 2-way anova technical question

2011-11-21 Thread Giovanni Azua
Hello Rob,

Thank you for your suggestions. I tried glm too without success. Anyhow I 
include all the information just in case someone with good knowledge can give 
me a hand with this. I take log of the response variable because: 
- its values span across multiple orders of magnitudes 
- the diagnostic plots e.g. QQ, residuals vs fitted etc do improve with that.

Below I include:
1) general summary of my data
2) 1-way anova and summary of the model
3) 4-way anova and summary of the model  

Attached:
a) Overview of the data (where main interactions occur i.e. No_databases and 
No_middlewares)
b) diagnostic plots for 2) Here the Normality assumption of the residuals looks 
reasonable
c) diagnostic plots for 3) Here the Normality assumption of the residuals does 
not seem to hold so it invalidates the 4-way aov model?

I tried glm and it delivers similar results as 3)

My impression is that my system is heavily polluted with outliers one can see 
that from plot a) how much the mean and the median differ due to the outliers. 
That's just the way the system I implemented behaves. Btw the system is a 
multi-tiered architecture that I developed in Java from scratch that includes 
XA and different data access and partitioning patterns. I need to 
quantitatively analyze and draw conclusion from this system. Most of my class 
mates just make it real simple: make 2^k experiments take one grand mean out of 
each experiment and do the ANOVA on those means i.e. 1-repetition, compute the 
fraction of variation and that's it. I am trying to model it more deeply by 
checking model assumptions, etc. 

Many thanks in advance,
Best regards,
Giovanni

> str(throughput)
'data.frame':   479 obs. of  9 variables:
 $ Time  : num  7 8 9 10 11 12 13 14 15 16 ...
 $ Throughput: int  155 155 154 157 155 214 4631 2118 136 132 ...
 $ Workload  : chr  "All" "All" "All" "All" ...
 $ No_databases  : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 
1 1 1 1 ...
 $ No_middlewares: Factor w/ 3 levels "1","2","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Queue_size: Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ...
 $ No_clients: Factor w/ 1 level "64": 1 1 1 1 1 1 1 1 1 1 ...
 $ Experimental_error: Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...

> summary(throughput)
  Time Throughput   Workload No_databases  
Partitioning No_middlewares
 Min.   : 7.00   Min.   :  35.0   Length:479 1:239sharding   
:240   1:160 
 1st Qu.:11.50   1st Qu.:  50.5   Class :character   4:240
replication:239   2:159 
 Median :16.00   Median : 744.0   Mode  :character  
4:160 
 Mean   :16.48   Mean   : 830.3 
  
 3rd Qu.:21.00   3rd Qu.:1205.5 
  
 Max.   :26.00   Max.   :4631.0 
  
 Queue_size No_clients Experimental_error
 40 :24064:479 1:479 
 100:239   

## ###
##
##  ANOVA "one-way" interaction
##
## ###
> throughput.aov <- 
> aov(log(Throughput)~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)
> throughput.aov
Call:
   aov(formula = log(Throughput) ~ No_databases + Partitioning + 
No_middlewares + Queue_size, data = throughput)

Terms:
No_databases Partitioning No_middlewares Queue_size Residuals
Sum of Squares  521.5264   5.697150.5814 0.4628  476.6826
Deg. of Freedom11  2  1   473

Residual standard error: 1.003885 
Estimated effects may be unbalanced
> summary(throughput.aov)
Df Sum Sq Mean Sq  F valuePr(>F)
No_databases  1 521.53  521.53 517.4974 < 2.2e-16 ***
Partitioning   1   5.705.70   5.6530   0.01782 *  
No_middlewares   2  50.58   25.29  25.0953 4.381e-11 ***
Queue_size  1   0.460.46   0.4592   0.49833
Residuals  473 476.681.01   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 

## ###
##
##  ANOVA 4-way interaction
##
## ###

> throughput.aov <- 
> aov(log(Throughput)~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput)
> throughput.aov
Call:
   aov(formula = log(Throughput) ~ No_databases * Partitioning * 
No_middlewares * Queue_size, data = throughput)

Terms:
No_databases Partitioning No_middlewares Queue_size 
No_databases:Partitioning
Sum of Squares  521.5264   5.697150.5814 0.4628 
  96.9198
Deg. of Freedom  

Re: [R] [OT] 1 vs 2-way anova technical question

2011-11-21 Thread Giovanni Azua
Hello Bert,

Thank you for taking the time to try to answer.

1) I know this, however if one is interested in only interaction between two 
specific factors then in R one uses I(A*B*C) meaning 3-way anova for that and 
not the implicit 2-ways that would otherwise be computed.

2) True, but it fails.

3) No, I don't have any factors with one level, I never said that. It would not 
be a 2^k experiment otherwise, my OP states this clearly, this is a 2^k 
experimental design ___2___

4) this is only your judgmental attitude that many people unfortunately have in 
some of these lists, focussing on ad-hominem judgements or even attacks to try 
to prove their superiority without actually answering nor adding any value to 
the question at hand. I have taken many graduate courses in subjects that have 
all Statistics in the title and passed all of them. However, as an experienced 
Software Engineer working for more than 10 years in the field, I can tell you 
that there is a huge difference between solving toy problems to implementing 
real-life complex projects.  Same rules apply here, one thing is the toy 
examples one finds in R books and course exercises and another totally 
different story is the real life data I am trying to model. I'm a student in 
the quantitative part and learning, so I do have some gaps, I am curious and 
trying to learn and I think there is no shame in that. If this makes you upset 
maybe you should ask to split the list in two or more: "Advanc!
 ed-PhD-black-belt-10th-dan-in-Statistics-and-R level" list and "newbies" list.

Best regards,
Giovanni

On Nov 21, 2011, at 3:55 PM, Bert Gunter wrote:

> Giovanni:
> 
> 1. Please read ?formula and/or An Introduction to R for how to specify
> linear models in R.
> 
> 2. Correct specification of what you want (if I understand correctly) is
> log(R) ~ A*B + C + D
> 
> 3. ... which presumably will also fail because some of your factors
> have only one level, which means that you cannot use them in your
> model.
> 
> 4. ... which, in turn, suggests you don't know what your doing
> statistically and should seek local assistance, especially in trying
> to interpret a fit to an unbalanced model (you can't do it as you
> probably think you can).
> 
> I should say in your defense that posts on this list indicate that
> point 4 is a widely shared problem among posters here.
> 
> Cheers,
> Bert
> 
> On Mon, Nov 21, 2011 at 5:02 AM, Giovanni Azua  wrote:
>> Hello,
>> 
>> Couple of clarifications:
>> - A,B,C,D are factors and I am also interested in possible interactions but 
>> the model that comes out from aov R~A*B*C*D violates the model assumptions
>> - My 2^k is unbalanced i.e. missing data and an additional level I also 
>> include in one of the factors i.e. C
>> - I was referring in the OP to the 4-way interactions and not 2-way, I'm 
>> sorry for my confusion.
>> - I tried to create an aov model with less interactions this way but I get 
>> the following error:
>> 
>> model.aov <- aov(log(R)~A+B+I(A*B)+C+D,data=throughput)
>> Error in `contrasts<-`(`*tmp*`, value = "contr.treatment") :
>>  contrasts can be applied only to factors with 2 or more levels
>> In addition: Warning message:
>> In Ops.factor(A, B) : * not meaningful for factors
>> 
>> Here I was trying to say: do a one-way anova except for the A and B factors 
>> for which I would like to get their 2-way interactions ...
>> 
>> Thanks in advance,
>> Best regards,
>> Giovanni
>> 
>> On Nov 21, 2011, at 12:04 PM, Giovanni Azua wrote:
>> 
>>> Hello,
>>> 
>>> I know there is plenty of people in this group who can give me a good 
>>> answer :)
>>> 
>>> I have a 2^k model where k=4 like this:
>>> Model 1) R~A*B*C*D
>>> 
>>> If I use the "*" in R among all elements it means to me to explore all 
>>> interactions and include them in the model i.e. I think this would be the 
>>> so called 2-way anova. However, if I do this, it leads to model violations 
>>> i.e. the homoscedasticity is violated, the normality assumption of the 
>>> sample errors i.e. residuals is violated etc. I tried correcting the issues 
>>> using different standard transformations: log, sqrt, Box-Cox forms etc but 
>>> none really improve the result. In this case even though the model 
>>> assumptions do not hold, some of the interactions are found to 
>>> significatively influence the response variable. But then shall I trust the 
>>> results of this Model 1) given that the assumptions do not hold?
>>> 
>>> Then I try this other model where I exclude

Re: [R] [OT] 1 vs 2-way anova technical question

2011-11-21 Thread Giovanni Azua
Hello,

Couple of clarifications: 
- A,B,C,D are factors and I am also interested in possible interactions but the 
model that comes out from aov R~A*B*C*D violates the model assumptions
- My 2^k is unbalanced i.e. missing data and an additional level I also include 
in one of the factors i.e. C
- I was referring in the OP to the 4-way interactions and not 2-way, I'm sorry 
for my confusion.
- I tried to create an aov model with less interactions this way but I get the 
following error:

model.aov <- aov(log(R)~A+B+I(A*B)+C+D,data=throughput)
Error in `contrasts<-`(`*tmp*`, value = "contr.treatment") : 
  contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(A, B) : * not meaningful for factors

Here I was trying to say: do a one-way anova except for the A and B factors for 
which I would like to get their 2-way interactions ...

Thanks in advance,
Best regards,
Giovanni

On Nov 21, 2011, at 12:04 PM, Giovanni Azua wrote:

> Hello,
> 
> I know there is plenty of people in this group who can give me a good answer 
> :)
> 
> I have a 2^k model where k=4 like this:
> Model 1) R~A*B*C*D
> 
> If I use the "*" in R among all elements it means to me to explore all 
> interactions and include them in the model i.e. I think this would be the so 
> called 2-way anova. However, if I do this, it leads to model violations i.e. 
> the homoscedasticity is violated, the normality assumption of the sample 
> errors i.e. residuals is violated etc. I tried correcting the issues using 
> different standard transformations: log, sqrt, Box-Cox forms etc but none 
> really improve the result. In this case even though the model assumptions do 
> not hold, some of the interactions are found to significatively influence the 
> response variable. But then shall I trust the results of this Model 1) given 
> that the assumptions do not hold?
> 
> Then I try this other model where I exclude the interactions (is this the 
> 1-way anova?):
> Model 2) R~A+B+C+D
> 
> In this one the model assumptions hold except the existence of some outliers 
> and a slightly heavy tail in the QQ-plot.
> 
> Given that the assumptions for Model 1) do not hold, I assume I should ignore 
> the results altogether for Model 1) or? or instead can I safely use the Sum 
> Sq. of Model 1) to get my table of percent of variations?
> 
> This to me was a bit counter-intuitive since I assumed that if there was 
> collinearity among factors (and there is e.g. I(A*B*C)) the Model 1) and I 
> included those interactions, my model would be more accurate ... ok this 
> turned into a brand new topic of model selection but I am mostly interested 
> in the question: if model is violated can I or must I not use the results 
> e.g. Sum Sqr for that model?
> 
> Can anyone advice please?
> 
> btw I have bought most books on R and statistical analysis. I have researched 
> them all and the ANOVA coverage is very shallow in most of them specially in 
> the R-sy ones, they just offer a slightly pimped up version of the R-help. 
> 
> I am also unofficially following a course on ANOVA from the university I am 
> registered in and most examples are too simplistic and either the assumptions 
> just hold easily or the assumptions don't hold and nothing happens.  
> 
> Thanks in advance,
> Best regards,
> Giovanni
> 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [OT] 1 vs 2-way anova technical question

2011-11-21 Thread Giovanni Azua
Hello,

I know there is plenty of people in this group who can give me a good answer :)

I have a 2^k model where k=4 like this:
Model 1) R~A*B*C*D

If I use the "*" in R among all elements it means to me to explore all 
interactions and include them in the model i.e. I think this would be the so 
called 2-way anova. However, if I do this, it leads to model violations i.e. 
the homoscedasticity is violated, the normality assumption of the sample errors 
i.e. residuals is violated etc. I tried correcting the issues using different 
standard transformations: log, sqrt, Box-Cox forms etc but none really improve 
the result. In this case even though the model assumptions do not hold, some of 
the interactions are found to significatively influence the response variable. 
But then shall I trust the results of this Model 1) given that the assumptions 
do not hold?

Then I try this other model where I exclude the interactions (is this the 1-way 
anova?):
Model 2) R~A+B+C+D

In this one the model assumptions hold except the existence of some outliers 
and a slightly heavy tail in the QQ-plot.

Given that the assumptions for Model 1) do not hold, I assume I should ignore 
the results altogether for Model 1) or? or instead can I safely use the Sum Sq. 
of Model 1) to get my table of percent of variations?

This to me was a bit counter-intuitive since I assumed that if there was 
collinearity among factors (and there is e.g. I(A*B*C)) the Model 1) and I 
included those interactions, my model would be more accurate ... ok this turned 
into a brand new topic of model selection but I am mostly interested in the 
question: if model is violated can I or must I not use the results e.g. Sum Sqr 
for that model?

Can anyone advice please?

btw I have bought most books on R and statistical analysis. I have researched 
them all and the ANOVA coverage is very shallow in most of them specially in 
the R-sy ones, they just offer a slightly pimped up version of the R-help. 

I am also unofficially following a course on ANOVA from the university I am 
registered in and most examples are too simplistic and either the assumptions 
just hold easily or the assumptions don't hold and nothing happens.  

Thanks in advance,
Best regards,
Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] anova to pca

2011-11-20 Thread Giovanni Azua
Hello,

I would like to reinforce my anova results using PCA i.e. which factor are most 
important because they explain most of the variance (i.e. signal) of my 2^k*r 
experiment. However, I get the following error while trying to run PCA:

> throughput.prcomp <- 
> prcomp(~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)
Error in prcomp.formula(~No_databases + Partitioning + No_middlewares +  : 
  PCA applies only to numerical variables

What is the most R-like concise way to map/transform those factor values into 
numerical values in a suitable way for PCA analysis? My first attempt would be:

# C++ "style"
throughput$No_databases_num <- (throughput$No_databases == 1) ? -1 : 1 
throughput$Partitioning_num <- (throughput$Partitioning == "sharding") ? -1 : 1 
etc.
How can I do this in the R way?

Would these -1, 1 be sensible for a PCA analysis or it just doesn't matter? How 
about a factor for which I have 3 levels? -1, 0 and 1? 

Many thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] aov how to get the SST?

2011-11-17 Thread Giovanni Azua
Hello,

I currently run aov in the following way:

> throughput.aov <- 
> aov(log(Throughput)~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)
> summary(throughput.aov)

Df Sum Sq Mean Sq  F valuePr(>F)
No_databases 1 184.68 184.675 136.6945 < 2.2e-16 ***
Partitioning 1  70.16  70.161  51.9321 2.516e-12 ***
No_middlewares   2  44.22  22.110  16.3654 1.395e-07 ***
Queue_size   1   0.40   0.395   0.29260.5888
Residuals  440 594.44   1.351   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In order to compute the fraction of variation I need to know the total Sum Sq. 
and I assume it is like this:

SST = SS-No_databases + SS-Partitioning + SS-No_middlewares + SS-Queue_size 
= 184.68 + 70.16 + 44.22 + 0.40
= 299.46

So the fraction of variation explained by the No_databases would be:

SST/SS-No_databases = 184.68/299.46 = 0.6167101 ... and finally I can say that 
the No_databases explains 61.6% of the variation in Throughput.

Is this correct? if so, how can I do the same calculations using R? I haven't 
found the way to extract the Sum Sq out of the throughput.aov Object. Is there 
a function to get this 0.6167101 and 61.6% results without having to do it 
manually? even better if I get a table containing all these fraction of 
variations?

Since this is a 2^k experiment, I cant see how the Residuals fit in the 
formula. When I introduce replications (blocking factor) then I can also 
include a SSE term into the SST calculation. 

TIA,
Best regards,
Giovanni


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] boxplot strange behavior

2011-11-16 Thread Giovanni Azua
Hello,

I generate box plots from my data like this:

qplot(x=xxx,y=column,data=data,geom="boxplot") + xlab("xxx") + ylab(ylabel) + 
theme_bw() + scale_y_log10() + geom_jitter(alpha=I(1/10))

The problem is that I see lot of points above the maximum at the same level as 
some outliers. It looks very weird as I expected the outliers to be "few" and 
specially not see any points other than the outliers below the minimum or above 
the maximum indicator. Can anyone explain what's going on?

Attached I send a snapshot, see the second level having some outliers, separate 
from the outliers there are also some points which seem not to be outliers and 
that are above the maximum indicator? is this a bug or am I missing anything?

TIA,
Best regards,
Giovanni

PS: I sent this to the ggplot2 list too (sorry for the double post but I am 
kind of under pressure with this)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] aov output question

2011-11-14 Thread Giovanni Azua
Hello,

I currently get anova results out of the aov function (see below) I use the 
model.tables and I believe it gives me back the model parameters of the fit 
(betas), however I don't see the intercept (beta_0) and don't understand what 
the "rep" output means and there is no description in the documentation. 

Another question: is there a function that outputs the results in a more 
meaningful way e.g. show the percentage of variation of each factor towards the 
response I believe the formula would be something like:
for factor X:  (Sum_Sq_X / Sum_Sq_Total)*100 

Thanks in advance,
Best regards,
Giovanni

> #throughput.aov <- 
> aov(Throughput~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput)
> throughput.aov
Call:
   aov(formula = Throughput ~ No_databases + Partitioning + No_middlewares + 
Queue_size, data = throughput)

Terms:
No_databases Partitioning No_middlewares Queue_size Residuals
Sum of Squares  43146975 73949061130  20710 195504055
Deg. of Freedom11  2  1   433

Residual standard error: 671.9453 
Estimated effects may be unbalanced
> summary(throughput.aov)
DfSum Sq  Mean Sq F valuePr(>F)
No_databases 1  43146975 43146975 95.5614 < 2.2e-16 ***
Partitioning 1  7394 7394  0.01640.8982
No_middlewares   2   9061130  4530565 10.0342 5.497e-05 ***
Queue_size   1 2071020710  0.04590.8305
Residuals  433 195504055   451511  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> model.tables(throughput.aov,type="effects",se=TRUE)
Design is unbalanced - use se.contrast() for se's
Tables of effects

 No_databases 
 1   4
-317.1 310
rep  217.0 222

 Partitioning 
sharding replication
   4.303   -3.91
rep  209.000  230.00

 No_middlewares 
 1  2   4
-97.93 -108.2 199
rep 139.00  150.0 150

 Queue_size 
 40 100
 -6.852   6.883
rep 220.000 219.000
> 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] 2^k*r (with replications) experimental design question

2011-11-13 Thread Giovanni Azua
Hi Denis,

Thank you again :) what do you exactly mean with "blocking factor", that it 
will be like the others? I'd prefer not to treat the replicates as random but 
rather account for the experimental error using the replicates. 

Ahhh I see what you mean, so the experimental error will show up as the SS of 
my new variable "Replicate" ... great!

Thank you!
Best regards,
Giovanni

On Nov 14, 2011, at 2:38 AM, Dennis Murphy wrote:

> I'm guessing you have nine replicates of a 2^5 factorial design with a
> couple of missing values. If so, define a variable to designate the
> replicates and use it as a blocking factor in the ANOVA. If you want
> to treat the replicates as a random rather than a fixed factor, then
> look into the nlme or lme4 packages.
> 
> HTH,
> Dennis
> 
> On Sun, Nov 13, 2011 at 4:33 PM, Giovanni Azua  wrote:
>> Hello,
>> 
>> I have one replication (r=1 of the 2^k*r) of a 2^k experimental design in 
>> the context of performance analysis i.e. my response variables are 
>> Throughput and Response Time. I use the "aov" function and the results look 
>> ok:
>> 
>>> str(throughput)
>> 'data.frame':   286 obs. of  7 variables:
>>  $ Time  : int  6 7 8 9 10 11 12 13 14 15 ...
>>  $ Throughput: int  42 44 33 41 43 40 37 40 42 37 ...
>>  $ No_databases  : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ...
>>  $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 
>> 1 1 1 1 ...
>>  $ No_middlewares: Factor w/ 2 levels "2","4": 1 1 1 1 1 1 1 1 1 1 ...
>>  $ Queue_size: Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ...
>>  $ No_clients: Factor w/ 1 level "128": 1 1 1 1 1 1 1 1 1 1 ...
>>> head(throughput)
>>  Time Throughput No_databases Partitioning No_middlewares Queue_size
>> 16 421 sharding  2 40
>> 27 441 sharding  2 40
>> 38 331 sharding  2 40
>> 49 411 sharding  2 40
>> 5   10 431 sharding  2 40
>> 6   11 401 sharding  2 40
>>> 
>>> throughput.aov <- 
>>> aov(Throughput~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)
>>> summary(throughput.aov)
>>  DfSum Sq  Mean Sq F valuePr(>F)
>> No_databases   128488651 28488651 53.4981 2.713e-12 ***
>> Partitioning17168771687  0.1346  0.713966
>> No_middlewares   1 5624454  5624454 10.5620  0.001295 **
>> Queue_size  1 5089250892  0.0956  0.757443
>> Residuals 281 149637226   532517
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>> 
>> 
>> This is somehow what I expected and I am happy, it is saying that the 
>> Throughput is significatively affected firstly by the number of database 
>> instances and secondly by the number of middleware instances.
>> 
>> The problem is that I need to integrate multiple replications of this same 
>> 2^k so I can also account for experimental error i.e. the _r_ of 2^k*r but I 
>> can't see how to integrate the _r_ term into the data and into the aov 
>> function parameters. Can anyone advice?
>> 
>> TIA,
>> Best regards,
>> Giovanni
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] 2^k*r (with replications) experimental design question

2011-11-13 Thread Giovanni Azua
Hello,

I have one replication (r=1 of the 2^k*r) of a 2^k experimental design in the 
context of performance analysis i.e. my response variables are Throughput and 
Response Time. I use the "aov" function and the results look ok:

> str(throughput)
'data.frame':   286 obs. of  7 variables:
 $ Time  : int  6 7 8 9 10 11 12 13 14 15 ...
 $ Throughput: int  42 44 33 41 43 40 37 40 42 37 ...
 $ No_databases  : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 1 1 
1 1 ...
 $ No_middlewares: Factor w/ 2 levels "2","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Queue_size: Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ...
 $ No_clients: Factor w/ 1 level "128": 1 1 1 1 1 1 1 1 1 1 ...
> head(throughput)
  Time Throughput No_databases Partitioning No_middlewares Queue_size 
16 421 sharding  2 40 
27 441 sharding  2 40
38 331 sharding  2 40
49 411 sharding  2 40
5   10 431 sharding  2 40
6   11 401 sharding  2 40
> 
> throughput.aov <- 
> aov(Throughput~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)
> summary(throughput.aov)
  DfSum Sq  Mean Sq F valuePr(>F)
No_databases   128488651 28488651 53.4981 2.713e-12 ***
Partitioning17168771687  0.1346  0.713966
No_middlewares   1 5624454  5624454 10.5620  0.001295 ** 
Queue_size  1 5089250892  0.0956  0.757443
Residuals 281 149637226   532517  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
> 

This is somehow what I expected and I am happy, it is saying that the 
Throughput is significatively affected firstly by the number of database 
instances and secondly by the number of middleware instances.

The problem is that I need to integrate multiple replications of this same 2^k 
so I can also account for experimental error i.e. the _r_ of 2^k*r but I can't 
see how to integrate the _r_ term into the data and into the aov function 
parameters. Can anyone advice?  

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] issue plotting TukeyHSD

2011-11-13 Thread Giovanni Azua
Hello,

When I try to use TukeyHSD in the following way it shows the confidence 
interval corresponding to the last factor only.

> throughput.aov <- 
> aov(Throughput~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput)

plot(TukeyHSD(throughput.aov)) # I expected here to see the confidence 
intervals for all factors but see only the last.

OTOH this one works but then it is unreadable due to the long labels of 
combined effects in the Y-axis ticks.

> throughput.aov <- 
> aov(Throughput~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput)

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] 2^k experiment generator

2011-11-13 Thread Giovanni Azua

Never mind, found it, it is the expand.grid function.

On Nov 13, 2011, at 3:25 PM, Giovanni Azua wrote:
> Hello,
> 
> While looking for info on 2^k experimental design and anova I remember I saw 
> somewhere there was a function to generate all the experiments.  I can't find 
> the function anymore can anyone suggest?
> 
> The function takes as input the factors and levels and generates all the 
> experiments. I know I can do it myself using recursion but I want to avoid 1) 
> reinventing the wheel and 2) making mistakes while at it.
> 
> TIA,
> Best regards,
> Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] dev.new() within a loop

2011-11-13 Thread Giovanni Azua

On Nov 13, 2011, at 3:23 PM, David Winsemius wrote:
>>> Please read both my comments and the FAQ more carefully . You are 
>>> inadequately considering the information that has been offered to you.
>>> 
>> Ok you wanted to make sure I have to read the FAQ well I didn't have to :) 
>> Googling using your suggestion found relevant matches and now it works.
> 
> Where does this resistance to reading the FAQ come from?

It is not resistance, the FAQ is very helpful for basic general questions but 
it can not cover all details. Sometimes it is very hard to find a specific 
detailed answer within a general FAQ. I have read the FAQ, what makes you think 
I didn't before? I just could not find the information I was looking for.


>> I had to wrap the ggplot call within a "print" for my loop to work which is 
>> IMHO one of the most obfuscated use-cases for using print I have bumped into.
> 
> It is a case of isolating the functional aspects of the plot-construction 
> process from the side-effects of interacting with a graphics device.
> 
>> So every user-defined function that try to plot anything has to be wrapped 
>> inside a print just in case it ever gets called from within a loop
> 
> That is not how I understand it. One reason is so there is an object in the 
> workspace that can be later modified. And I suspect the authors (and I am not 
> one of them)  imagined that there may be multiple steps in creation of the  
> object, not all of which should result in a plot appearing if this is being 
> done programatically. This is especially appropriate (it would seem to me) 
> for the ggplot plotting model, which adds a variety of layers to a core 
> object. It is also imagined that you may be sending this object to one of a 
> variety of devices.
> 
Ok bottom line is always wrap the plot call whatever it is within a print for 
the just in case.

Cheers,
Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] 2^k experiment generator

2011-11-13 Thread Giovanni Azua
Hello,

While looking for info on 2^k experimental design and anova I remember I saw 
somewhere there was a function to generate all the experiments.  I can't find 
the function anymore can anyone suggest?

The function takes as input the factors and levels and generates all the 
experiments. I know I can do it myself using recursion but I want to avoid 1) 
reinventing the wheel and 2) making mistakes while at it.

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] dev.new() within a loop

2011-11-13 Thread Giovanni Azua
Hello David,

On Nov 13, 2011, at 5:20 AM, David Winsemius wrote:
>> However, when executing plot_raw which invokes dev.new(..) all windows come 
>> out blank whereas if I execute each file outside of a loop then I can see 
>> the plots properly.
> 
> Perhaps ...(you did not say what package this plot_raw function comes from) 
> ...  Read the FAQ about why lattice plot don't print. (It applies to all grid 
> based plotting functions.)
> 
plot_raw is my own function which just calls ggplot2. So basically I am not 
using Lattice.

What can I do differently to avoid the new windows coming back empty?

Thanks in advance,
Best regards,
Giovanni

plot_raw <- function(data,connect=TRUE,y_break=500,y_top=-1,label="") {
dev.new()
title <- paste(label, sep="")
if (y_top == -1) {
y_top <- max(data$Y)
}

if (!decouple) {
# add fake group
data$Workload <- 'All'
}

p <- 
ggplot(data,aes(x=Time,y=Y,group=Workload,shape=Workload,colour=Workload)) + 
geom_point(fill="white", size=3) + 
scale_y_continuous(breaks=seq(0,max(data$Y),y_break), limits=c(0, y_top)) +  
scale_y_continuous(breaks=seq(0,y_limit_top(data$Y,data$se), 
y_break_step(data$Y,data$se)), 
limits=c(0, y_limit_top(data$Y,data$se))) + 
opts(title=title) + theme_bw() +
scale_x_continuous(breaks=data$Time, labels=as.character(data$Time))
 

if (connect) {
p + geom_line()
} else {
p
}
}
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] dev.new() within a loop

2011-11-12 Thread Giovanni Azua
Hello,

I have a loop where I iterate performance data files within a folder, parse and 
plot them in one shot (see below).

However, when executing plot_raw which invokes dev.new(..) all windows come out 
blank whereas if I execute each file outside of a loop then I can see the plots 
properly. What's wrong here?

Thanks in advance,
Best regards,
Giovanni

# given a directory name, it will iterate all files that match the given pattern
#basedir <- "/Users/bravegag/code/asl11/data/2k-r1-test-2011_data/"
basedir <- "/Users/bravegag/code/asl11/data/nclients_2_128-2010_data/"
pattern <- paste("logs.*cl\\-.*mw\\-.*db\\-.*\\-client\\.dat",sep="")
all_files <- dir(path=basedir, pattern=pattern)

throughput <- NULL
response <- NULL

#file_name <- all_files[1]

# iterate all files
for (file_name in all_files) {
print(paste("processing", file_name, "..."))

df <- read.table(paste(basedir, file_name, sep=""))  # read 
the data as a data frame
names(df)[names(df)=="V1"] <- "Time"
names(df)[names(df)=="V2"] <- "Partitioning"
names(df)[names(df)=="V3"] <- "Workload"
names(df)[names(df)=="V4"] <- "Runtime"

# get rid of first and last n minutes 
df <- subset(df, df$Time > warmup_cooldown_minutes)
df <- subset(df, df$Time < (max(df$Time) - warmup_cooldown_minutes))

# 
=
# Throughput
# 
=
if (decouple) {
dft <- aggregate(x=df$Runtime, by=list(df$Time,df$Workload), 
FUN=length)
names(dft)[names(dft)=="Group.1"] <- "Time"   
names(dft)[names(dft)=="Group.2"] <- "Workload"
names(dft)[names(dft)=="x"] <- "Y"

} else {
dft <- aggregate(x=df$Runtime, by=list(df$Time), FUN=length)
names(dft)[names(dft)=="Group.1"] <- "Time"   
names(dft)[names(dft)=="x"] <- "Y"
}
dft$se <- 0
plot_raw(dft,connect=TRUE,label=file_name)
}
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] 2^k*r experimental design and anova

2011-11-10 Thread Giovanni Azua
Hello,

Can anyone point me to an online tutorial or book containing the easiest way to 
do ANOVA over the result data from a 2^k*r experiment. It is not clear to me if 
I can pass the raw data corresponding to each experiment or just the summarized 
data i.e. mean, sse, std, etc.

I would like to get the:
- box plot showing the effect for the different factors and levels
- plot showing whether there are interaction between every pair of factors
- see the results in tabular form showing the percent of influence of each 
factor and since this is "r" see also the percent due to experimental error.

I would very much prefer if the plotting is using ggplot2.

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] binning runtimes

2011-10-24 Thread Giovanni Azua
Hello,

Suppose I have the dataset shown below. The amount of observations is too 
massive to get a nice geom_point and smoother on top. What I would like to do 
is to bin the data first. The data is indexed by Time (minutes from 1 to 120 
i.e. two hours of System benchmarking).

Option 1) group the data by Time i.e. minute 1, minute 2, etc and within each 
group create bins of N consecutive observations and average them into one 
observation, the bins become the new data points to use for the geom_point 
plot. How can I do this? Shingle? how to do that? 

Option 2)  Another option is to again group by Time i.e. minute 1, minute 2, 
etc and within each group draw a random observation to be the representative 
for the corresponding bin. I could not clearly see how to use Random.

> dfs <- subset(df, Partitioning == "Sharding")
> head(dfs)
  Time Partitioning Workload Runtime
11 ShardingQuery3301
21 ShardingQuery3268
31 ShardingQuery2878
41 ShardingQuery2819
51 ShardingQuery3310
61 ShardingQuery3428
> str(dfs)
'data.frame':   102384 obs. of  4 variables:
 $ Time: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Partitioning: Factor w/ 2 levels "Replication",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Workload: Factor w/ 2 levels "Query","Refresh": 1 1 1 1 1 1 1 1 1 1 ...
 $ Runtime : int  3301 3268 2878 2819 3310 3428 2837 2954 2902 2936 ...
> 

Many thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] code review: is it too much to ask?

2011-10-23 Thread Giovanni Azua
Hello all,

I really appreciate how helpful the people in this list are. Would it be too 
much to ask to send a small script to have it peer-reviewed? to make sure I am 
not making blatant mistakes? The script takes an experiment.dat as input and 
generates system Throughput using ggplot2. It works now ... [sigh] but I have 
this nasty feeling that I might be doing something wrong :). Changing "samples" 
i.e. number of samples per group produces arbitrarily different results, I 
basically increased it (until 9) until there were no strongly deterministic 
periodicities. This is not a full-fledge experiment but just a preliminary 
report that will show I have implemented a healthy system. Proper experimental 
analysis comes after varying factors according to the 2^k*r experimental design 
etc 

Some key points I would like to find out:
- aggregation is not breaking the natural order of the measurements i.e. if 
there are 20 runtimes taken in that order, and I make groups of 10 measurements 
(to compute statistics on them) the first group must contain the first 10 
runtimes and the second group must contain the second 10 runtimes. I am not 
sure if the choice of aggregation etc is respecting this.
- I am not sure if it is best to do the binning by filling the bins by time 
intervals of by number of observations.

Your help will be greatly appreciated!

I have the data too and the plots look very nice but it is a 4mb file.

TIA
Best regards,
Giovanni

# 
=
# Advanced Systems Lab 
# Milestone 1
# Author: Giovanni Azua
# Date: 22 October 2011
# 
=

rm(list=ls())# clear 
workspace

library(boot)# use boot 
library
library(ggplot2) # use 
ggplot2 library
library(doBy)# use doBy 
library

# 
=
# ETL Step
# 
=

data_file <- file("/Users/bravegag/code/asl11/trunk/report/experiment.dat")
df <- read.table(data_file)  # reads 
the data as data frame
class(df)# show the 
class to be 'list' 
names(df)# data is 
prepared correcly in Python
str(df)
head(df)

names(df)[names(df)=="V1"] <- "Time" # change 
column names
names(df)[names(df)=="V2"] <- "Partitioning"
names(df)[names(df)=="V3"] <- "Workload"
names(df)[names(df)=="V4"] <- "Runtime"
str(df)
head(df)

# 
=
# Define utility functions
# 
=

se <- function(x) sqrt(var(x)/length(x))
sst <- function(x) sum(x-mean(x))^2

##  COPIED FROM 

## 
http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29
## 
*
## Summarizes data.
## Gives count, mean, standard deviation, standard error of the mean, and 
confidence interval (default 95%).
## If there are within-subject variables, calculate adjusted values using 
method from Morey (2008).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be 
summariezed
##   groupvars: a vector containing names of columns that contain grouping 
variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 
95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, 
conf.interval=.95) {
require(doBy)

# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else   length(x)
}

# Collapse the data
formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), 
sep=" ~ "))
datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)

# Rename columns
names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- 
measurevar
names(datac)[ names(datac) == paste(measurevar, ".sd", sep="&q

[R] summarizing a data frame i.e. count -> group by

2011-10-23 Thread Giovanni Azua
Hello,

This is one problem at the time :)

I have a data frame df that looks like this:

  time partitioning_mode workload runtime
1 1  shardingquery 607
2 1  shardingquery  85
3 1  shardingquery  52
4 1  shardingquery  79
5 1  shardingquery  77
6 1  shardingquery  67
7 1  shardingquery  98
8 1  sharding  refresh2932
9 1  sharding  refresh2870
101  sharding  refresh2877
111  sharding  refresh2868
121   replicationquery2891
131   replicationquery2907
141   replicationquery2922
151   replicationquery2937

and if I could use SQL ... omg! I really wish I could! I would do exactly this:

insert into throughput
  select time, partitioning_mode, count(*)
  from data.frame 
  group by time, partitioning_mode

My attempted R versions are wrong and produce very cryptic error message:

> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], 
> by=list(df$time,df$partitioning_mode), count)
Error in `[.default`(df2, u_id, , drop = FALSE) : 
  incorrect number of dimensions

> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count)
Error in `[.default`(df2, u_id, , drop = FALSE) : 
  incorrect number of dimensions

>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count)
I cant comprehend what comes out from this one ... :(

and I thought C++ template errors were the most cryptic ;P

Many many thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unfold list (variable number of columns) into a data frame

2011-10-23 Thread Giovanni Azua
Hi Dennis,

Thank you very nice :)

Best regards,
Giovanni

On Oct 23, 2011, at 6:55 PM, Dennis Murphy wrote:

> Hi:
> 
> Here's one approach:
> 
> # Function to process a list component into a data frame
> ff <- function(x) {
> data.frame(time = x[1], partitioning_mode = x[2], workload = x[3],
>runtime = as.numeric(x[4:length(x)]) )
>   }
> 
> # Apply it to each element of the list:
> do.call(rbind, lapply(data, ff))
> 
> or equivalently, using the plyr package,
> 
> library('plyr')
> ldply(data, ff)
> 
> # Example:
> L <- list(c("1", "sharding", "query", "607", "85", "52", "79", "77",
> "67", "98"),
>  c("1", "sharding", "refresh", "2932", "2870", "2877", "2868"),
>  c("1", "replication", "query", "2891", "2907", "2922", "2937"))
> do.call(rbind, lapply(L, ff))
>   time partitioning_mode workload runtime
> 1 1  shardingquery 607
> 2 1  shardingquery  85
> 3 1  shardingquery  52
> 4 1  shardingquery  79
> 5 1  shardingquery  77
> 6 1  shardingquery  67
> 7 1  shardingquery  98
> 8 1  sharding  refresh2932
> 9 1  sharding  refresh2870
> 101  sharding  refresh2877
> 111  sharding  refresh2868
> 121   replicationquery2891
> 131   replicationquery2907
> 141   replicationquery2922
> 151   replicationquery2937
> 
> HTH,
> Dennis

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] unfold list (variable number of columns) into a data frame

2011-10-23 Thread Giovanni Azua
Hello,

I used R a lot one year ago and now I am a bit rusty :)

I have my raw data which correspond to the list of runtimes per minute (minute 
"1" "2" "3" in two database modes "sharding" and "query" and two workload types 
"query" and "refresh") and as a list of char arrays that looks like this:

> str(data)
List of 122
 $ : chr [1:163] "1" "sharding" "query" "607" "85" "52" "79" "77" "67" "98"  ...
 $ : chr [1:313] "1" "sharding" "refresh" "2932" "2870" "2877" "2868" ...
 $ : chr [1:57] "1" "replication" "query" "2891" "2907" "2922" "2937" ...
 $ : chr [1:278] "1" "replication refresh "79" "79" "89" "79" "89" "79" "79" 
"79" ...
 $ : chr [1:163] "2" "sharding" "query" "607" "85" "52" "79" "77" "67" "98"  ...
 $ : chr [1:313] "2" "sharding" "refresh" "2932" "2870" "2877" "2868" ...
 $ : chr [1:57] "2" "replication" "query" "2891" "2907" "2922" "2937" ...
 $ : chr [1:278] "2" "replication refresh "79" "79" "89" "79" "89" "79" "79" 
"79" ...
 $ : chr [1:163] "3" "sharding" "query" "607" "85" "52" "79" "77" "67" "98"  ...
 $ : chr [1:313] "3" "sharding" "refresh" "2932" "2870" "2877" "2868" ...
 $ : chr [1:57] "3" "replication" "query" "2891" "2907" "2922" "2937" ...
 $ : chr [1:278] "3" "replication refresh "79" "79" "89" "79" "89" "79" "79" 
"79" ...
 
I would like to transform the one above into a data frame where this structure 
in unfolded in the following way:

'data.frame': N obs. of  3 variables:
 $ time : int  1 1 1 1 1 1 1 1 1 1 1 ...
 $ partitioning_mode : chr "sharding" "sharding" "sharding" "sharding" 
"sharding" "sharding" "sharding" "sharding" "sharding" "sharding" ...
 $ workload : chr "query" "query" "query" "query" "query" "query" "query" 
"refresh" "refresh" "refresh" "refresh" ...
 $ runtime : num  607 85 52 79 77 67 98 2932 2870 2877 2868...

So instead of having an associative array (variable number of columns) it 
should become a simple list where the group or factors are repeated for every 
occurrence of the  specific runtime. Basically my ultimate goal is to get a 
data frame structure that is "summarizeBy"-friendly and "ggplot2-friendly" i.e. 
using this data frame format.

Help greatly appreciated!

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] issue loading doBy library

2011-10-23 Thread Giovanni Azua
Hi Josh,

Thank you for your feedback, after lot of trial and error the problem is 
finally solved.

To solve this problem, I tried in this order:

1) uninstalling the two packages "Matrix" and "lme4" and reinstalling them.
2) uninstalling doBy and reinstalling it with and without 1)
3) upgrading to the latest R version and re-doing 1) 2)

And finally 4) wiped all R traces from my Mac OS X 10.7.2 and re-installed the 
latest version. The latest Matrix version seems to have changed in the 
meanwhile so I was lucky and now it works.

It seems to me that the whole concept of dependency analysis for installing 
packages in R is broken ... seems like the packages depend only on the package 
name and not on the specific versions which is wrong as in this case, chances 
are that a user will say in this "live lock" where will never find a "happy 
together" versions of Matrix and lme4 ... but well, you statisticians know 
better about chances :P
 
Thank you again.
Best regards,
Giovanni

On Oct 23, 2011, at 3:12 AM, Joshua Wiley wrote:

> Hi Giovanni,
> 
> This is a dependency issue between lme4 and Matrix.  There is
> substantial discussion of this on the R sig mixed models list.  A
> simple update may fix the problem, or you may need to be a little bit
> more precise about getting version of Matrix and lme4 that work with
> each other.
> 
> HTH,
> 
> Josh

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] issue loading doBy library

2011-10-22 Thread Giovanni Azua
Hello,

How can I fix this? I have the latest version of R 2.13.2 and I use Mac OS X 
10.7.2

> library(doBy)
Loading required package: lme4
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
function 'cholmod_l_start' not provided by package 'Matrix'
Error: package 'lme4' could not be loaded
> library(lme4)
Error in dyn.load(file, DLLpath = DLLpath, ...) : 
function 'cholmod_l_start' not provided by package 'Matrix'
Error: package/namespace load failed for 'lme4'

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] update and rebuild all?

2010-08-24 Thread Giovanni Azua
Hello,

I upgraded my Mac R version to the newest 2.11.1, then I ran the option to 
update all packages but there was an error related to fetching one of those and 
the process stopped. I retried updating all packages but nothing happens. 
Although all my course project scripts work perfectly is there a way e.g. a 
command to manually fetch (most up to date version) and locally build all 
installed packages? to make sure it is all ok? I recall there was something 
like that but don't remember what command it was.

Thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plot for linear discriminant

2010-05-16 Thread Giovanni Azua
Hello Hadley,

Thank you very much for your help! I have just received your book btw :)

On May 16, 2010, at 6:16 PM, Hadley Wickham wrote:
>Hi Giovanni,
>
>Have a look at the classifly package for an alternative approach that
>works for all classification algorithms.  If you provided a small
>reproducible example, I could step through it for you.
>
>Hadley

Please find below a self contained example. I managed to complete the task 
using the graphics package. I would be curious to see how to get one of those 
really nice ggplot2 graphs with decision boundaries and class regions :) 

Thank you!
Best regards,
Giovanni

# 
=
# (1) Generate sample labelled data   
# 
=

rm(list=ls())   
  # clear workspace
library(mvtnorm)
 # needed for rmvnorm
set.seed(11)
   # predictability of results

sigma <- cbind(c(0.5, 0.3), c(0.3, 0.5))
   # true covariance matrix
mu <- matrix(0,nrow=3,ncol=2)
mu[1,] <- c(3, 1.5) 
  # true mean vectors
mu[2,] <- c(4,   4)
mu[3,] <- c(8.5, 2)
x <- matrix(0, nrow = 300, ncol = 3)
x[,3] <- rep(1:3, each = 100)   
# class labels
x[1  :100,1:2] <- rmvnorm(n = 100, mean = mu[1,], sigma = sigma)   # simulate 
data
x[101:200,1:2] <- rmvnorm(n = 100, mean = mu[2,], sigma = sigma)
x[201:300,1:2] <- rmvnorm(n = 100, mean = mu[3,], sigma = sigma)

# 
=
# (2) Plot the labelled data   
# 
=

##
## Function for plotting the data separated by classes, hacked out of predplot:
## http://stat.ethz.ch/teaching/lectures/FS_2010/CompStat/predplot.R
##
plotclasses <- function(x, main = "", len = 200, ...) {
xp <- seq(min(x[,1]), max(x[,1]), length=len)
yp <- seq(min(x[,2]), max(x[,2]), length=len)
grid <- expand.grid(xp, yp)
colnames(grid) <- colnames(x)[-3]
plot(x[,1],x[,2],col=x[,3],pch=x[,3],main=main,xlab='x_1',ylab='x_2')
text(2.5,4.8,"Class 1",cex=.8)  
# class 1  
text(4.2,1.0,"Class 2",cex=.8)  
# class 2
text(8.0,0.5,"Class 3",cex=.8)  
# class 3
}

plotclasses(x)

# 
=
# (3) Functions needed: calculate separating hyperplane between two given 
# classes and converting hyperplanes to line equations for the p=2 case 
# 
=

##
## Returns the coefficients for the hyperplane that separates one class from 
another. 
## Computes the coefficients according to the formula: 
## $x^T\hat{\Sigma}^{-1}(\hat{\mu}_0-\hat{\mu}_1) - \frac{1}{2}(\hat{\mu}_0 + 
## 
\hat{\mu}_1)^T\hat{\Sigma}^{-1}(\hat{\mu}_0-\hat{\mu}_1)+\log(\frac{p_0}{p_1})$ 
 
##
## sigmainv(DxD) - precalculated sigma (covariance matrix) inverse
## mu1(1xD) - precalculated mu mean for class 1
## mu2(1xD) - precalculated mu mean for class 2
## prior1 - precalculated prior probability for class 1
## prior2 - precalculated prior probability for class 2
##
ownldahyperplane <- function(sigmainv,mu1,mu2,prior1,prior2) {
J <- nrow(mu)   
   # number of classes
b <- sigmainv%*%(mu1 - mu2)
c <- -(1/2)*t(mu1 + mu2)%*%sigmainv%*%(mu1 - mu2) + log(prior1/prior2) 
return(list(b=b,c=c))
}

##
## Returns linear betas (intersect and slopes) for the given hyperplane 
structure. 
## The structure is a list that matches the output of the function defined 
above. 
##
ownlinearize <- function(sephyp) {
return(list(beta0=-sephyp$c/sephyp$b[2],
   # line slope and intersect
beta1=-sephyp$b[1]/sephyp$b[2]))
}

# 
=
# (4) Run lda  
# 
=

library(MASS)   
   # needed for lda/qda

# read in a function 

[R] abline limit constrain x-range how?

2010-05-15 Thread Giovanni Azua
Hello,

I managed to "linearize" my LDA decision boundaries now I would like to call 
abline three times but be able to specify the exact x range. I was reading the 
doc but it doesn't seem to support this use-case? are there alternatives. The 
reason why I use abline is because I first call plot to plot all the three 
datasets and then call abline to "append" these decision boundary lines to the 
existing plot ...

Can anyone please advice?
Thanks in advance,
Best regards,
Giovanni 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] plot for linear discriminant

2010-05-15 Thread Giovanni Azua
Hello,

I have a labelled dataset with three classes. I have computed manually the LDA 
hyperplane that separate the classes from each other i.e.

\hat{\delta}_j(x)=x^Tb_j + c_j where b_j \in \mathbb{R}^p and c_j \in \mathbb{R}

my concrete b_j looks like e.g. 
b_j <- rbind(1,2)
c_j <- 3

How can I plot y=x^Tb_j + c_j ??  two problems:

1- I need lines and the dimension of my x is 2
2- I would like the plotted lines to end when they intersect so they nicely 
show the decision boundaries

Any pointers? maybe an example with ggplot2 I could not find any from the 
showcase documentation page ...

Thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plot formula 'x' is missing?

2010-05-14 Thread Giovanni Azua
Hi Jorge and Dennis,

Thank you for the hint! 

However, I'm still very intrigued as to why it does not work using plot ... 
what is special about this specific formula that plot doesn't like it?

Best regards,
Giovanni

On May 15, 2010, at 7:12 AM, Jorge Ivan Velez wrote:
> Hi Giovanni,
> 
> curve(1/(1+exp(5.0993-0.1084*x)), 0, 100)
> 
> HTH,
> Jorge
> 
> 
> On Sat, May 15, 2010 at 12:43 AM, Giovanni Azua <> wrote:
> Hello,
> 
> I'd like to plot the logistic function for a specific model like this:
> 
> > plot(formula=y~1/(1+exp(5.0993-0.1084*x)),data=data.frame(x=seq(0,100,length.out=1000)))
> Error in is.function(x) : 'x' is missing
> 
> However, I get the 'x' is missing error above and don't know how to fix it ...
> 
> Can anyone advice?
> Thanks in advance,
> Best regards,
> Giovanni
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] plot formula 'x' is missing?

2010-05-14 Thread Giovanni Azua
Hello,

I'd like to plot the logistic function for a specific model like this:

> plot(formula=y~1/(1+exp(5.0993-0.1084*x)),data=data.frame(x=seq(0,100,length.out=1000)))
Error in is.function(x) : 'x' is missing 

However, I get the 'x' is missing error above and don't know how to fix it ...

Can anyone advice?
Thanks in advance,
Best regards,
Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plot with no default axis labels

2010-05-14 Thread Giovanni Azua
Hello Jim,

Very nice example! thank you!

Best regards,
Giovanni

On May 14, 2010, at 11:50 AM, Jim Lemon wrote:

> On 05/14/2010 07:31 PM, Giovanni Azua wrote:
>> Hello,
>> 
>> I could not find an easy way to have the plot function not display the 
>> default x and y-axis labels, I would like to customize it to show only 
>> points of interest ... I would like to:
>> 
>> 1- call plot that show no x-axis and y-axis labels
>> 2- call axis specifying the exact points of interest for the x and y-axis
>> 
>> Maybe they can both be achieved in the plot function call but I can't find 
>> the right way to do it ...
>> 
> Hi Giovanni
> Let's get creative:
> 
> # making up data can be fun!
> mydata<-sort(runif(20))
> plot(mydata,main="A most creative plot",
> xlab="",ylab="",type="b",axes=FALSE)
> box()
> axis(1,at=c(1,7,9,14,17))
> require(plotrix)
> staxlab(2,at=c(0,0.3279614,0.4431966,0.729113,0.937461))
> 
> Jim
> 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plot with no default axis labels

2010-05-14 Thread Giovanni Azua
Hello,

I found the answer here:
http://www.statmethods.net/advgraphs/axes.html

basically plot(...,axes=FALSE,...)  ## avoids default axis labels

Best regards,
Giovanni

On May 14, 2010, at 11:31 AM, Giovanni Azua wrote:

> Hello,
> 
> I could not find an easy way to have the plot function not display the 
> default x and y-axis labels, I would like to customize it to show only points 
> of interest ... I would like to:
> 
> 1- call plot that show no x-axis and y-axis labels
> 2- call axis specifying the exact points of interest for the x and y-axis
> 
> Maybe they can both be achieved in the plot function call but I can't find 
> the right way to do it ...
> 
> Thanks in advance,
> Best regards,
> Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] plot with no default axis labels

2010-05-14 Thread Giovanni Azua
Hello,

I could not find an easy way to have the plot function not display the default 
x and y-axis labels, I would like to customize it to show only points of 
interest ... I would like to:

1- call plot that show no x-axis and y-axis labels
2- call axis specifying the exact points of interest for the x and y-axis

Maybe they can both be achieved in the plot function call but I can't find the 
right way to do it ...

Thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ggplot2's geom_errorbar legend

2010-05-02 Thread Giovanni Azua
Hello Ista,

On May 1, 2010, at 8:37 PM, Ista Zahn wrote:
> Hi Giovanni,
> A reproducible example would help. Also, since I think this will be
> tricky, it might be a good idea to post it to the ggplot2 mailing list
> (you can register at http://had.co.nz/ggplot2/ ).
> 
> Best,
> Ista

First, thank you so much for showing me the way ... the plots generated from 
ggplot2 are super nice! 

Please find below a self contained reproducible example nearly exactly as the 
ones I need to produce.

Thanks in advance,
Best regards,
Giovanni

rm(list=ls())  # clear workspace
library(ggplot2)   # use ggplot2 library

methodlabels <- c("Classic", "Own Bootstrap",  # prepare plot 
labels 
"R Bootstrap")   
errortypelabels <- c("'Normal'", "'Student-t'",# prepare plot labels
"'Exponential'")
betalabels <- c("'Beta_1'", "'Beta_2'", "'Beta_3'")# prepare plot labels
betas <- c(1, -2, 3)   # my real parameters
set.seed(11)
S <- 100
B <- 50
P <- length(betas)
classiccis <- owncis <- rbootcis <- array(data=0,dim=c(S,1,3,2))
for (p in 1:P) {
classiccis[,1,p,1] <- rnorm(100,mean=betas[p]-1.0,sd=0.5)
classiccis[,1,p,2] <- rnorm(100,mean=betas[p]+1.0,sd=0.5)
owncis[,1,p,1] <- rnorm(100,mean=betas[p]-1.0,sd=0.5)
owncis[,1,p,2] <- rnorm(100,mean=betas[p]+1.0,sd=0.5)
rbootcis[,1,p,1] <- rnorm(100,mean=betas[p]-1.0,sd=0.5)
rbootcis[,1,p,2] <- rnorm(100,mean=betas[p]+1.0,sd=0.5)
}

##
## Function that generates the ggplot of the CI for a given parameter
## and error type
##
ciplot <- function() {
x <- 1:S
y <- rep(betas[p],S)
data <- data.frame(x,y)
classiclimits <- aes(x=x[1:100],ymax = classiccis[1:100,e,p,1], 
ymin=classiccis[1:100,e,p,2]) 
ownlimits <- aes(x=x[1:100]+0.4,ymax = owncis[1:100,e,p,1], 
ymin=owncis[1:100,e,p,2]) 
rbootlimits <- aes(x=x[1:100]+0.8,ymax = rbootcis[1:100,e,p,1], 
ymin=rbootcis[1:100,e,p,2]) 
g1 <- ggplot(data, aes(x=x, y=y))
g1 + geom_errorbar(classiclimits, colour="red") + geom_errorbar(
 ownlimits, colour="green") + geom_errorbar(rbootlimits, 
 colour="blue") + geom_hline(yintercept = betas[p]) + xlab(
 "Simulation") + ylab(betalabels[p]) + opts(title = paste("CI for",
 errortypelabels[e],"error, ",betalabels[p],",",S/5,
 "simulations and ",B,"bootstrapped samples"))
}

##
## Execute one by one ... note they have to be global 
## otherwise seems that ggplot won't find them.
## 
## p - means parameter index
## e - means error type index
## 
p=1;e=1;ciplot()
p=1;e=2;ciplot()
p=1;e=3;ciplot()
p=2;e=1;ciplot()
p=2;e=2;ciplot()
p=2;e=3;ciplot()
p=3;e=1;ciplot()
p=3;e=2;ciplot()
p=3;e=3;ciplot() 
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] cbind and automatic type conversion

2010-05-01 Thread Giovanni Azua
Hello,

I have three method types and 100 generalization errors for each, all in the 
range [0.65,0.81]. I would like to make a stacked histogram plot using ggplot2 
with this data ...
 
Therefore I need a data frame of the form e.g.

Method   GE
--   --
"Classic"0.76
"Classic"0.79
"Own Bootstrap"   0.81
"Own Bootstrap"   0.79
"R Bootstrap"0.71
"R Bootstrap"0.75

So I combine the data in the following way:

normalerrors <- rbind(cbind(rep("Classic",S),classicge[,1]),
cbind(rep("Own Bootstrap",S),ownge[,1]),cbind(rep("R 
Bootstrap",S),rbootge[,1]))
normalerrors <- data.frame(method=factor(normalerrors[,1]),ge=normalerrors[,2])

But doing it in this way my GE coefficients get automatically converted to 
string type ... how can I avoid this conversion when doing the cbind?

TIA,
Best regards,
Giovanni


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] closest match in R to c-like struct?

2010-05-01 Thread Giovanni Azua

On May 1, 2010, at 6:48 PM, steven mosher wrote:
> I was talking with another guy on the list about this very topic.
> 
> A simple example would help.
> 
> first a sample C struct, and then how one would do the equivalent in R.
> 
> In the end i suppose one want to do a an 'array' of these structs, or list
> of the structs.

Or like in my use-case ... I needed a c-like struct to define the type for 
aggregating the data to return from a function. 

Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] closest match in R to c-like struct?

2010-05-01 Thread Giovanni Azua

On May 1, 2010, at 5:04 PM, (Ted Harding) wrote:
> Well, 'list' must be pretty close! The main difference would be
> that in C the structure type would be declared first, and then
> applied to create an object with that structure, whereas an R
> lists are created straight off. If you want to set up a generic
> list type for a certain purpose, you would wrap its definition
> in a function.
> 
> Another difference is that R lacks the "pointer" type, so that
> R's "mylist$component" is the equivalent of C's "mylist.component";
> I don't think you can do the equivalent in R of C's "mylist->component"
> (though I'm likely to be wrong about that, and to be promptly corrected)!
> 
> Hopingb this helps,
> Ted.

Thank you all for your help! this list is really great ... I mean the mailing 
list :)

Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] bootstrap generalization error

2010-05-01 Thread Giovanni Azua
Hello,

I use the following function "bootstrapge" to calculate (and compare) the 
generalization error of several bootstrap implementations:

##
## Calculates and returns a coefficient corresponding to the generalization 
## error. The formula for the bootstrap generalization error is:
## $N^{-1}\sum_{i=1}^n B^{-1}\sum_{j=1}^B |y_i - (\beta_n^{*j})^T x|$
## 
## x - mxn matrix where m is the number of model parameters and n is the 
## number of observations
## y - n column-vector containing true values
## theta_star - mxn matrix where m is the number of bootstrapped samples 
##  and n is the number of model parameters 
##
bootstrapge <- function(x,y,theta_star) {
B <- nrow(theta_star)
P <- ncol(theta_star)

t <- 0
for (b in 1:B) {
t <- t + abs(y - rbind(theta_star[b,])%*%x)
}
return(mean(t/B))
}  

Is there a nicer/faster way to accomplish the same using implicit loop 
functions e.g. apply, sapply etc I could not figure it out ...

Is there a way to get a similar coefficient using the boot library? I could not 
find any way to get such a "generalization error" so I can compare my 
implementation with that one ...

TIA,
Best regards,
Giovanni
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] closest match in R to c-like struct?

2010-05-01 Thread Giovanni Azua
Hello,

What would be in R the closest match to a c-struct? e.g. data.frame requires 
all elements to be of the same length ... or is there a way to circumvent this?

TIA,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ggplot2's geom_errorbar legend

2010-05-01 Thread Giovanni Azua
Hello,

I create a simple ggplot that only shows a straight line. I then add three 
datasets of CI using the geom_errorbar function. The problem is that I can't 
find any way to have the legend showing up ... I need to show what each  color 
of the CIs corresponds to i.e. which method. 

Can anyone advice please?
TIA,
Best regards,
Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] apply question

2010-04-30 Thread Giovanni Azua
Hello David,

On Apr 30, 2010, at 11:00 PM, David Winsemius wrote:
> Note: Loops may be just as fast or faster than apply calls.
> 
How come!? is this true also for other similar functions: lapply, tapply and 
sapply?

Then the only advantage of these above is only syntactic sugar? 

>> 
>> indices <- replicate(B,sample(1:N, size=N, replace=TRUE))
>> theta_star <- apply(indices,2,statistic,data)
> 
> Why not:
> 
> theta_star <- apply(indices,2, function(x) statistic(data, x) )
> 
> May end up as a data class that needs further work, depending on what 
> "statistic" return.
> 
Nice! thank you!

Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] apply question

2010-04-30 Thread Giovanni Azua
Hello,

I have a bootstrap implementation loop that I would like to replace by a faster 
"batch operator".

The loop is as follows:

for (b in 1:B) {   
  indices <- sample(1:N, size=N, replace=TRUE) # sample n elements with 
replacement 
  theta_star[b,] = statistic(data,indices) # execute statistic callback 
function
}

I would like to replace this by something like (does not work):

indices <- replicate(B,sample(1:N, size=N, replace=TRUE))
theta_star <- apply(indices,2,statistic,data)

This, however, does not work because the statistic function receives indices as 
second parameter and not as first parameter. While I could change the signature 
of my statistic function I would like to know how to achieve this without 
having to do so.

TIA,
Best regards,
Giovanni
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ggplot2 legend how?

2010-04-30 Thread Giovanni Azua
Hello,

I have just ordered the "ggplot2: Elegant Graphics for Data Analysis (Use R)" 
but while it arrives :) can anyone please show me how to setup and add a simple 
legend to a ggplot?

This is my use case, I need a legend showing CI "Classic", "Own bootstrap", "R 
bootstrap":

library(ggplot2)

e <- 1
p <- 1
x <- 1:S
y <- rep(betas[p],S)
data <- data.frame(x,y)
classiclimits <- aes(x=x,ymax = classic[,e,p,1], ymin=classic[,e,p,2]) 
ownlimits <- aes(x=x+0.4,ymax = own[,e,p,1], ymin=own[,e,p,2]) 
rbootlimits <- aes(x=x+0.8,ymax = rboot[,e,p,1], ymin=rboot[,e,p,2]) 
g1 <- ggplot(data, aes(x, y))
g1 + geom_errorbar(classiclimits, colour = "red")  + geom_errorbar(ownlimits, 
colour = "green") + geom_errorbar(rbootlimits, colour = "blue") + 
geom_hline(yintercept = betas[p]) + xlab("Simulation") + ylab("beta_1") + 
opts(title = "CI for error 'normal' and beta_1") + opts(legend.position = 
c(10,2.5))

Many thanks in advance,
Best regards,
Giovanni
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SOLVED plotting multiple CIs

2010-04-30 Thread Giovanni Azua
Hello,

After installing gfortran from http://r.research.att.com/gfortran-4.2.3.dmg it 
finally works! see below.

Thank you all.

@Ista Zahn: Looks fantastic! :) thank you so much! ... is there a way to have a 
small circle on the true value? 

Best regards,
Giovanni

> install.packages("Hmisc", dependencies=TRUE, type ="source")
trying URL 'http://cran.ch.r-project.org/src/contrib/Hmisc_3.7-0.tar.gz'
Content type 'application/x-gzip' length 563054 bytes (549 Kb)
opened URL
==
downloaded 549 Kb

* installing *source* package ‘Hmisc’ ...
** libs
** arch - i386
gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/i386  -I/usr/local/include  
  -fPIC  -g -O2 -c Hmisc.c -o Hmisc.o
gfortran -arch i386   -fPIC  -g -O2 -c cidxcn.f -o cidxcn.o
gfortran -arch i386   -fPIC  -g -O2 -c cidxcp.f -o cidxcp.o
gfortran -arch i386   -fPIC  -g -O2 -c hoeffd.f -o hoeffd.o
gfortran -arch i386   -fPIC  -g -O2 -c jacklins.f -o jacklins.o
gfortran -arch i386   -fPIC  -g -O2 -c largrec.f -o largrec.o
gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/i386  -I/usr/local/include  
  -fPIC  -g -O2 -c mChoice.c -o mChoice.o
gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/i386  -I/usr/local/include  
  -fPIC  -g -O2 -c nstr.c -o nstr.o
gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/i386  -I/usr/local/include  
  -fPIC  -g -O2 -c ranksort.c -o ranksort.o
gfortran -arch i386   -fPIC  -g -O2 -c rcorr.f -o rcorr.o
gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/i386  -I/usr/local/include  
  -fPIC  -g -O2 -c string_box.c -o string_box.o
gfortran -arch i386   -fPIC  -g -O2 -c wclosest.f -o wclosest.o
gcc -arch i386 -std=gnu99 -dynamiclib -Wl,-headerpad_max_install_names 
-undefined dynamic_lookup -single_module -multiply_defined suppress 
-L/usr/local/lib -o Hmisc.so Hmisc.o cidxcn.o cidxcp.o hoeffd.o jacklins.o 
largrec.o mChoice.o nstr.o ranksort.o rcorr.o string_box.o wclosest.o 
-lgfortran -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework 
-Wl,CoreFoundation
installing to 
/Library/Frameworks/R.framework/Versions/2.11/Resources/library/Hmisc/libs/i386
** arch - x86_64
gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/x86_64  
-I/usr/local/include-fPIC  -g -O2 -c Hmisc.c -o Hmisc.o
gfortran -arch x86_64   -fPIC  -g -O2 -c cidxcn.f -o cidxcn.o
gfortran -arch x86_64   -fPIC  -g -O2 -c cidxcp.f -o cidxcp.o
gfortran -arch x86_64   -fPIC  -g -O2 -c hoeffd.f -o hoeffd.o
gfortran -arch x86_64   -fPIC  -g -O2 -c jacklins.f -o jacklins.o
gfortran -arch x86_64   -fPIC  -g -O2 -c largrec.f -o largrec.o
gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/x86_64  
-I/usr/local/include-fPIC  -g -O2 -c mChoice.c -o mChoice.o
gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/x86_64  
-I/usr/local/include-fPIC  -g -O2 -c nstr.c -o nstr.o
gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/x86_64  
-I/usr/local/include-fPIC  -g -O2 -c ranksort.c -o ranksort.o
gfortran -arch x86_64   -fPIC  -g -O2 -c rcorr.f -o rcorr.o
gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/x86_64  
-I/usr/local/include-fPIC  -g -O2 -c string_box.c -o string_box.o
gfortran -arch x86_64   -fPIC  -g -O2 -c wclosest.f -o wclosest.o
gcc -arch x86_64 -std=gnu99 -dynamiclib -Wl,-headerpad_max_install_names 
-undefined dynamic_lookup -single_module -multiply_defined suppress 
-L/usr/local/lib -o Hmisc.so Hmisc.o cidxcn.o cidxcp.o hoeffd.o jacklins.o 
largrec.o mChoice.o nstr.o ranksort.o rcorr.o string_box.o wclosest.o 
-lgfortran -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework 
-Wl,CoreFoundation
installing to 
/Library/Frameworks/R.framework/Versions/2.11/Resources/library/Hmisc/libs/x86_64
** R
** inst
** preparing package for lazy loading
Loading required package: splines
** help
*** installing help indices
** building package indices ...
** testing if installed package can be loaded

* DONE (Hmisc)

The downloaded packages are in

‘/private/var/folders/5m/5mC0wYCSGMK8NbPZYhBEZE+++TI/-Tmp-/Rtmpz0R5nY/downloaded_packages’
Updating HTML index of packages in '.Library'
> 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/m

Re: [R] plotting multiple CIs

2010-04-30 Thread Giovanni Azua

Hello David,

On Apr 30, 2010, at 6:00 PM, David Winsemius wrote:
> Looks like you do not have the RTools bundle and perhaps not the XCode 
> framework either?
> 
> I am not suggesting that you do so, since it appears you are not conversant 
> with compiling source code packages. If I am wrong about that point, then 
> this is where you would get the gfortran compiler that Simon uses:
> 
> http://r.research.att.com/tools/
> 
> My advice would be to go back to R.2.10.1 if you need Rcmdr.
> 
I do have Xcode latest  but not RTools. 

This below was my last failed attempt.

Best regards,
Giovanni

> install.packages("Hmisc", dependencies=TRUE, type ="source")
trying URL 'http://cran.ch.r-project.org/src/contrib/Hmisc_3.7-0.tar.gz'
Content type 'application/x-gzip' length 563054 bytes (549 Kb)
opened URL
==
downloaded 549 Kb

* installing *source* package ‘Hmisc’ ...
** libs
** arch - i386
gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include 
-I/Library/Frameworks/R.framework/Resources/include/i386  -I/usr/local/include  
  -fPIC  -g -O2 -c Hmisc.c -o Hmisc.o
gfortran -arch i386   -fPIC  -g -O2 -c cidxcn.f -o cidxcn.o
make: gfortran: No such file or directory
make: *** [cidxcn.o] Error 1
ERROR: compilation failed for package ‘Hmisc’
* removing 
‘/Library/Frameworks/R.framework/Versions/2.11/Resources/library/Hmisc’

The downloaded packages are in

‘/private/var/folders/5m/5mC0wYCSGMK8NbPZYhBEZE+++TI/-Tmp-/Rtmpz0R5nY/downloaded_packages’
Updating HTML index of packages in '.Library'
Warning message:
In install.packages("Hmisc", dependencies = TRUE, type = "source") :
  installation of package 'Hmisc' had non-zero exit status
> 
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] plotting multiple CIs

2010-04-30 Thread Giovanni Azua
Hello Zahn,

On Apr 30, 2010, at 4:35 PM, Ista Zahn wrote:
> Hi Giovanni,
> I think the ggplot2 package might help you out here. Do you want
> something like this?

Thank you for your suggestion however I could not give it a try since landed in 
the same  issue being reported about the Hmisc package:

> install.packages("Hmisc", dependencies=TRUE)
Warning message:
In getDependencies(pkgs, dependencies, available, lib) :
  package ‘Hmisc’ is not available
> library(ggplot2)
> library(Hmisc)
Error in library(Hmisc) : there is no package called 'Hmisc'
> 
> data(mtcars)
> 
> p <- ggplot(mtcars, aes(x=cyl, y=mpg))
> 
> p + stat_summary(fun.data = "mean_cl_boot", colour = "red", geom =
+ "errorbar") +  stat_summary(fun.data = "mean_cl_normal", colour =
+ "blue", geom = "errorbar")
Error: Hmisc package required for this functionality.  Please install and try 
again.
>

I am running Mac OS X Snow Leopard latest version and also have the latest R UI 
for Mac.

How can Hmisc be installed manually?

Best regards,
Giovanni  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] plotting multiple CIs

2010-04-30 Thread Giovanni Azua
Hello,

I need to plot multiple confidence intervals for the same model parameter e.g. 
so for the same value of the parameter in point x_1 I would like to see four 
different confidence intervals so that I can compare the accuracy e.g. boot 
basic vs normal vs my own vs classic lm CI etc.   

I like very very much the plotCI implemented here:
http://cran.r-project.org/web/packages/plotrix/index.html

This is their example:
library(plotrix)
y <- runif(10)
err <- runif(10)
plotCI(x=1:10,y=y,uiw=err,liw=2*err,lwd=2,col="red",scol="blue",main="Add 
colors to the points and error bars")

but does not seem to support plotting multiple CIs but only one per point, is 
there a similar library somewhere but having the possibility to plot multiple 
CIs?

I know there is the function "segment" but I like this one above more :)

Thanks in advance,
Best regards,
Giovanni
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] function pointer question

2010-04-26 Thread Giovanni Azua
Hello Jan,

On Apr 26, 2010, at 8:56 AM, Jan van der Laan wrote:
> You can use the '...' for that, as in:
> 
> loocv <- function(data, fnc, ...) {
> n <- length(data.x)
> score <- 0
> for (i in 1:n) {
>   x_i <- data.x[-i]
>   y_i <- data.y[-i]
>   yhat <- fnc(x=x_i,y=y_i, ...)
>   score <- score + (y_i - yhat)^2
> }
> score <- score/n
> return(score)
> }
> 
> scoreks <- loocv(data,gaussiankernel, h=0.5)

It works nicely, thank you!

Best regards,
Giovanni

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] function pointer question

2010-04-25 Thread Giovanni Azua
The beauty of trial and error ... if I leave the non x, y parameters i.e. h as 
global parameters rather than formal parameters for gaussiankernel it works 
fine basically I don't pass anymore h=0.5 to gaussiankernel but consume it from 
a global variable. Ugly but works ...

Best regards,
Giovanni 

On Apr 26, 2010, at 1:38 AM, Giovanni Azua wrote:

> Hello,
> 
> I have the following function that receives a "function pointer" formal 
> parameter name "fnc":
> 
> loocv <- function(data, fnc) {
>   n <- length(data.x)
>   score <- 0
>   for (i in 1:n) {
> x_i <- data.x[-i]
> y_i <- data.y[-i]
> yhat <- fnc(x=x_i,y=y_i)
> score <- score + (y_i - yhat)^2  
>   }
>   score <- score/n
>   return(score)
> }
> 
> I would like to use it like this:
> 
> ##
> ## Estimator function using Gaussian Kernel
> ##
> gaussiankernel <- function(x,y,h) {
>   modelks <- ksmooth(x,y,kernel="normal",bandwidth=h,x.points=x)
>   yhat <- modelks$y
>   return(yhat)
> } 
> 
> scoreks <- loocv(data,gaussiankernel(h=0.5))
> 
> I expected this to work but it doesn't :( basically I wanted to take 
> advantage of the named parameters so I could pass the partially specified 
> function parameter "gaussiankernel" to loocv specifying only the h parameter 
> and then let loocv specify the remaining parameters as needed ... can this be 
> tweaked to work? The idea is to have loocv generic so it can work for any 
> estimator implementation ...
> 
> I have more than 6 books now in R and none explains this important concept.
> 
> Thanks in advance,
> Best regards,
> Giovanni 
> 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] function pointer question

2010-04-25 Thread Giovanni Azua
Hello,

I have the following function that receives a "function pointer" formal 
parameter name "fnc":

loocv <- function(data, fnc) {
  n <- length(data.x)
  score <- 0
  for (i in 1:n) {
x_i <- data.x[-i]
y_i <- data.y[-i]
yhat <- fnc(x=x_i,y=y_i)
score <- score + (y_i - yhat)^2  
  }
  score <- score/n
  return(score)
}

I would like to use it like this:

##
## Estimator function using Gaussian Kernel
##
gaussiankernel <- function(x,y,h) {
  modelks <- ksmooth(x,y,kernel="normal",bandwidth=h,x.points=x)
  yhat <- modelks$y
  return(yhat)  
} 

scoreks <- loocv(data,gaussiankernel(h=0.5))

I expected this to work but it doesn't :( basically I wanted to take advantage 
of the named parameters so I could pass the partially specified function 
parameter "gaussiankernel" to loocv specifying only the h parameter and then 
let loocv specify the remaining parameters as needed ... can this be tweaked to 
work? The idea is to have loocv generic so it can work for any estimator 
implementation ...

I have more than 6 books now in R and none explains this important concept.

Thanks in advance,
Best regards,
Giovanni 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] interpreting acf plot

2010-04-17 Thread Giovanni Azua
Hello Denis,

(1) I appreciate your feedback, however, I feel I have all the right to ask a 
specific question related R namely what's the interpretation of the acf 
function plot. I gave away the information that it is a homework because many 
times people before helping ask what's the context for the question at hand.  
If I don't understand something I will for sure ask. I don't have anything to 
hide so I don't care if there are professors subscribed to this list. My 
ultimate goal is to learn and it doesn't really matter whether it is studying a 
book, asking an assistant or asking in a forum. 

(2) After looking in many references and not finding any clue ... I Googled for 
information and found that I should be "looking for cyclic patterns" i.e. 
oscillations? There are none in this dataset so I presume there would not be 
any autocorrelation, oder?

(3) This is something very unfortunate ... the course Lectures are great, the 
course script is very comprehensive, however, the assignments many times 
include questions that are a bit off topic like in this case of Time Series and 
includes no actual reference ... so it is no surprise that even after 
diligently attending all lectures and doing all exercises I get stuck. Please 
recommend what's the best book in this topic of Time Series analysis maybe with 
R. I will buy it.

(4) Yes they mentioned something like this in the assignment "Dependency can be 
verified by showing that under the model, Cov(X_t^2,X_{t-h}^2) \neq 0, h > 0 
(complicated). Plot and interpret the autocorrelation functions of X_t and 
X_t^2 for the BMW-dataset." 
http://stat.ethz.ch/teaching/lectures/FS_2010/CompStat/series4.pdf

Thank you.
Best regards,
Giovanni

On Apr 17, 2010, at 7:25 PM, Dennis Murphy wrote:

> Hi:
> 
> (1) If you read the posting guide (as you were asked to do before posting), 
> you would find out
>   rather quickly that this is not a forum to help you with homework. 
> Moreover, several of your
>   professors may subscribe to this list and notice your request.
> (2) What 'trend' is in this data set? It has excessive variation at certain 
> points of the series,
>   but what trend?
> (3) None of your cited references is likely to have much that describes what 
> autocorrelation means.
>  (The only exception might be HSAUR, but it focuses more on the 
> programming than the concepts.)
>   I'd consult an actual time series text to learn the concepts you need 
> to make sense of the plots.
> (4) You can't 'prove' that the series in question is (or is not) 
> autocorrelated with R or any other
>   software; however, it can be used to provide empirical support for one 
> hypothesis or the other.
>   Proof is a mathematical construct involving deductive logic, whereas 
> statistical inference uses 
>   inductive logic. They represent different approaches to problem solving.
> 
> HTH,
> Dennis

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] interpreting acf plot

2010-04-17 Thread Giovanni Azua
Hello,

I am attending a course in Computational Statistics at ETH and in one of the 
assignments I am asked to prove that a time series is not autocorrelated using 
the R function "acf".

I tried out the acf function with the given data, according to what I found 
here: http://landshape.org/enm/options-for-acf-in-r/ this test data does not 
look IID but rather shows some trends so how can I then prove that it is not 
autocorrelated? maybe the trends are ok? 

I have bought several titles on R but none really explains autocorrelation or 
how to interpret the acf function ... the integrated help is also a bit dry. 
These are the books I have on R:
- Introductory Statistics with R (Springer)
- A handbook of Statistical Analyses using R (CRC)
- R in a Nutshell (Oreilly)
- Statistical Computing with R (CRC)

Thanks in advance,
Best regards,
Giovanni

# 
=
# Computational Statistics 
# Series 4
# Author: Giovanni Azua
# Date: 16 April 2010
# 
=
rm(list=ls())  # clear workspace

# 
=
# EXERCISE 1.(c)
# 
=

   # load dataset from web
#bmwlr <- scan("http://stat.ethz.ch/Teaching/Datasets/bmw.dat";)
   # load dataset from file
bmwlr <- scan("/Users/bravegag/code/compstats/bmw.dat")

par(mfrow=c(1,2))  # visualize two plots
acf(bmwlr, lag.max = 10)
acf(bmwlr^2, lag.max = 10)
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] create space for a matrix

2010-03-28 Thread Giovanni Azua Garcia
Hi Leo,

see the matrix function e.g. 

m <- matrix(0, nrow=1, ncol=3)

then you can use functions like rbind or cbind to create bigger ones.

I am a newbie so double check everything :)

HTH,
Best regards,
Giovanni

On Mar 29, 2010, at 8:37 AM, leobon wrote:

> 
> Hello all,
> I want to creat a space for a matrix. 
> For example, I have a matrix with the dimension of 1(row)*3(col), which is
> (2,3,4). Then I want to put this matrix into a new and larger matrix. The
> dimension of the new matrix is 3*3. So what I want is creating a new larger
> matrix which allows me to put each smaller matrix with the same columns into
> it. 
> I know 0*(1:3) is the way to creat a space for 3 numbers. Could any body
> tell me how to creat a space for a matrix?
> Appreciate very much for your help!
> 
> Leo
> -- 
> View this message in context: 
> http://n4.nabble.com/create-space-for-a-matrix-tp1694754p1694754.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] data fitting and confidence band

2010-03-27 Thread Giovanni Azua Garcia
Hello,

I am fitting data using different methods e.g. Local Polynomial and Smoothing 
splines. The data is generated out of a true function model with added normally 
distributed noise.

I would like to know "how often the confidence band for all points 
simultaneously contain all true values". I can answer the question for one 
point in the following way:

e.g. 

# 
=
# How many times the pointwise confidence interval at x=0.5 contains the true 
value at 0.5 
# i.e. what is the so called "coverage rate"?
# 
=
pos = which(x==0.5)
sum(abs(estlp[pos,] - m(x[pos])) <= 1.96*selp[pos,])   # equidistant x outputs 
946
   # non-equidistant x 
outputs 938
sum(abs(estss[pos,] - m(x[pos])) <= 1.96*sess[pos,])   # equidistant x outputs 
895
   # non-equidistant x 
outputs 936

This basically tells me that out of 1000 simulation runs with different random 
noise, 946 times the true value i.e. m(x) for x=0.5 is contained within the 95% 
confidence interval. The estlp Local Polynomial performs better than Smoothing 
Splines under this criteria ...

Now is there any specific way to answer  "how often the confidence band for all 
points simultaneously contain all true values" other than this below?

# 
=
# How often does the confidence band for all points simultaneously contain all 
true values?
# 
=
sum(abs(estlp[,] - m(x[])) <= 1.96*selp[,])# equidistant x outputs 
92560
   # non-equidistant x 
outputs 92109
sum(abs(estss[,] - m(x[])) <= 1.96*sess[,])# equidistant x outputs 
90804
   # non-equidistant x 
outputs 94641

Is there a dedicated function in R for this purpose i.e. to build confidence 
bands around a given fit ... maybe a way to plot it nicely too given that the 
Estimated SE are calculated.

Many thanks in advance,
Best regards,
Giovanni


  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.