[R] Time Series prediction
Hello, I would like to use a parametric TS model and predictor as benchmark to compare against other ML methods I'm employing. I currently build a simple e.g. ARIMA model using the convenient auto.arima function like this: library(forecast) df <- read.table("/Users/bravegag/data/myts.dat") # btw my test data doesn't have seasonality but periodicity so the value # 2 is arbitrarily set, using a freq of yearly or 1 would make unhappy some # R ts functions tsdata <- ts(df$signal, freq=2) arimamodel <- auto.arima(tsdata, max.p=15, max.q=10, stationary=FALSE, ic="bic", stepwise=TRUE, seasonal=FALSE, parallel=FALSE, num.cores=4, trace=TRUE, allowdrift=TRUE) arimapred <- forecast.Arima(arimamodel, h=20) plot(arimapred) The problem is that the forecast.Arima function is apparently doing a "free run" i.e. it uses the forecast(t+1) value as input to compute forecast(t+2) and I'm instead interested in a prediction mode where it always use the observed tsdata(t+1) value to predict forecast(t+2), the observed tsdata(t+2) to predict forecast(t+3) and so on. Can anyone please advice how to achieve this? TIA, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text mining
Hi Steve, IMO this problem does not need a classifier but rather a database and a simple query. I would just build a database with all city names including the geo information, and then say whether it is north or south exactly. If there was such a "rule" (which I doubt) I would expect it to have many exceptions and therefore a bunch of false-positives on both sides. Why overcomplicate a simple problem? HTH, Ciao, Giovanni -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Steve Stephenson Sent: Saturday, January 26, 2013 10:08 PM To: r-help@r-project.org Subject: [R] Text mining Hallo to everybody, I would like to perform an analysis but I don't know how to proceed and whether R packages are available for my purpose or not. Therefore I'm here to request your support. *The idea is the following:* I noticed that the names of the towns and villages in northern Italy most of the time sound differently from names of cities based on southern Italy. Just to give you an idea "Caronno Pertusella" is a northern Italy village while Frascati is a center Italy town. Most of the time I am able to recognize where the town is located just hearing the name but I cannot say why, that is to say that I didn't find a "rule". What I would like to do is to find a classification rule/engine that is able to "locate" the city starting from its name. *I think the classification method should be based on the sequence of letters belonging to the town's name*. But this is just an intuition not yet formalized! I know that mine is a strange request and idea, anyway advices are very appreciated and welcome! Many thanks in advance to all. Steve -- View this message in context: http://r.789695.n4.nabble.com/Text-mining-tp4656732.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] use subset to trim data but include last per category
Hello, This solves my problem in a horribly inelegant way that works: df <- data.frame(n=newInput$n, iter=newInput$iter, Error=newInput$Error, Duality_Gap=newInput$Duality, Runtime=newInput$Acc) df_last <- aggregate(x=df$iter, by=list(df$n), FUN=max) names(df_last)[names(df_last)=="Group.1"] <- "n" names(df_last)[names(df_last)=="x"] <- "iter" # n iter #1 1000 2518 #2 2000 5700 #3 3000 10026 #4 4000 13916 #5 5000 17962 df$last <- FALSE df$last[df$n == 1000 & df$iter == 2518] <- TRUE df$last[df$n == 2000 & df$iter == 5700] <- TRUE df$last[df$n == 3000 & df$iter == 10026] <- TRUE df$last[df$n == 4000 & df$iter == 13916] <- TRUE df$last[df$n == 5000 & df$iter == 17962] <- TRUE df <- subset(df, (iter %% 500 == 0) | (df$last == TRUE)) How can I do the same without hardwiring these numbers? TIA, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] use subset to trim data but include last per category
Hi Jeff, Thanks for your help, but this doesn't work, there are two problems. First and most important I need to keep the last _per category_ where my category is n and not the last globally. Second, there seems to be an issue with the subset variation that ends up not filtering anything ... but this is a minor thing. Best. Giovanni On Sep 9, 2012, at 5:59 PM, Jeff Newmiller wrote: > dfthin <- df[ c(which(iter %% 500 == 0),nrow(df) ] > > or > > dfthin <- subset(df, (iter %% 500 == 0) | (seq.int(nrow(df)==nrow(df))) > > N.B. You should avoid using the name "df" for your variables, because it is > the name of a built-in function that you are hiding by doing so. Others may > be confused, and eventually you may want to use that function yourself. One > solution is to use DF for your variables... another is to use more > descriptive names. > --- > Jeff NewmillerThe . . Go Live... > DCN:Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/BatteriesO.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > ------- > Sent from my phone. Please excuse my brevity. > > Giovanni Azua wrote: > >> Hello, >> >> I bumped into the following funny use-case. I have too much data for a >> given plot. I have the following data frame df: >> >>> str(df) >> 'data.frame':5015 obs. of 5 variables: >> $ n : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1 >> ... >> $ iter : int 10 20 30 40 50 60 70 80 90 100 ... >> $ Error : num 1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ... >> $ Duality_Gap: num 20080 3789 855 443 321 ... >> $ Runtime: num 0.00536 0.01353 0.01462 0.01571 0.01681 ... >> >> But if I plot e.g. Runtime vs log(Duality Gap) I have too many >> observations due to taking a snapshot every 10 iterations rather than >> say 500 and the plot looks very cluttered. So I would like to trim the >> data frame including only those records for which iter is multiple of >> 500 and so I do this: >> >> df <- subset(df, iter %% 500 == 0) >> >> This gives me almost exactly what I need except that the last and most >> important Duality Gap observations are of course gone due to the >> filtering ... I would like to change the subset clause to be iter %% >> 500 _or_ the record is the last per n (n is my problem size and >> category in this case) ... how can I do that? >> >> I thought of adding a new column that flags whether a given row is the >> last element per category as "last" Boolean but this is a bit too >> complicated .. is there a simpler condition construct that can be used >> with the subset command? >> >> TIA, >> Best regards, >> Giovanni >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] use subset to trim data but include last per category
Hello, I bumped into the following funny use-case. I have too much data for a given plot. I have the following data frame df: > str(df) 'data.frame': 5015 obs. of 5 variables: $ n : Factor w/ 5 levels "1000","2000",..: 1 1 1 1 1 1 1 1 1 1 ... $ iter : int 10 20 30 40 50 60 70 80 90 100 ... $ Error : num 1.05e-02 1.24e-03 3.67e-04 1.08e-04 4.05e-05 ... $ Duality_Gap: num 20080 3789 855 443 321 ... $ Runtime: num 0.00536 0.01353 0.01462 0.01571 0.01681 ... But if I plot e.g. Runtime vs log(Duality Gap) I have too many observations due to taking a snapshot every 10 iterations rather than say 500 and the plot looks very cluttered. So I would like to trim the data frame including only those records for which iter is multiple of 500 and so I do this: df <- subset(df, iter %% 500 == 0) This gives me almost exactly what I need except that the last and most important Duality Gap observations are of course gone due to the filtering ... I would like to change the subset clause to be iter %% 500 _or_ the record is the last per n (n is my problem size and category in this case) ... how can I do that? I thought of adding a new column that flags whether a given row is the last element per category as "last" Boolean but this is a bit too complicated .. is there a simpler condition construct that can be used with the subset command? TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Grid package: how to customize cell spacing?
Hello, I am using the recipe below to place plots side by side: http://wiki.stdout.org/rcookbook/Graphs/Multiple%20graphs%20on%20one%20page%20%28ggplot2%29/ How can I reduce or customize the horizontal spacing between the grid cells? I have researched the Grid package but can't find the way to do so. TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] simplest way (set of functions) to parse a file
;Step1','Step2','Step3', 'Step4')) > > x > n iter variable value > 1102Step1 0.001779780 > 2204Step1 0.000955757 > 364 10Step1 0.001613035 > 464 20Step1 0.0 > 564 30Step1 0.0 > 664 40Step1 0.0 > 764 50Step1 0.0 > 864 53Step1 0.0 > 9 10Step1 0.002975209 > 10 20Step1 0.0 > 11 30Step1 0.0 > 12 40Step1 0.0 > 13 50Step1 0.0 > 14 60Step1 0.0 > 15 70Step1 0.0 > 16 80Step1 0.0 > 17 81Step1 0.0 > 18 102Step2 0.72146 > 19 204Step2 0.000230161 > 20 64 10Step2 0.003956920 > 21 64 20Step2 0.004390998 > 22 64 30Step2 0.004326610 > 23 64 40Step2 0.004329726 > 24 64 50Step2 0.004327526 > 25 64 53Step2 0.001327239 > 26 10Step2 0.018730820 > 27 20Step2 0.020846370 > 28 30Step2 0.020831700 > 29 40Step2 0.020882840 > 30 50Step2 0.020459540 > 31 60Step2 0.020825860 > 32 70Step2 0.020832040 > 33 80Step2 0.020335710 > 34 81Step2 0.002100072 > > > On Mon, Aug 27, 2012 at 6:24 AM, Giovanni Azua wrote: > Hello, > > What would be the best set of R functions to parse and transform a file? > > My file looks as shown below. I would like to plot this data and I need to > parse it into a single data frame that sorts of "transposes the data" with > the following structure: > > > df <- data.frame(n=c(1,1,2,2),iter=c(1,2,1,2),step=as.factor(c('Step 1', > > 'Step2', 'Step 1', 'Step 2')),value=c(10, 10, 10, 10)) > > str(df) > 'data.frame': 4 obs. of 4 variables: > $ n: num 1 1 2 2 > $ iter : num 1 2 1 2 > $ step : Factor w/ 3 levels "Step 1","Step 2",..: 1 3 1 2 > $ value: num 10 10 10 10 > > n=extracted from the file name "logdet_two_moons_n>>>>10<<<<.txt" > iter=iter > step=column Step1, Step2, Step3, Step4 > value=value of the specific Step column > > And this is one possible data frame variation to be able to plot the time > proportions for the different steps of my algorithm. > > TIA, > Best regards, > Giovanni > > Iteration JminError Elapsed Corral Duality Gap Step1 Step2 > Step3 Step4 > 2 2 0.00e+001.912976e-031 0.00e+00 > 1.779780e-037.214600e-051.243600e-052.246700e-05 > ../test/genmoons_data/logdet_two_moons_n10.txt,2,2,1.754115e-02,0.00e+00,9.799000e+03,0.00e+00,5.586293e-01,0.00e+00 > Iteration JminError Elapsed Corral Duality Gap Step1 Step2 > Step3 Step4 > 4 9 0.00e+001.280841e-032 -7.105427e-15 > 9.557570e-042.301610e-041.571100e-052.177300e-05 > ../test/genmoons_data/logdet_two_moons_n20.txt,4,5,6.062756e-03,0.00e+00,1.365970e+05,0.00e+00,2.253051e+01,0.00e+00 > Iteration JminError Elapsed Corral Duality Gap Step1 Step2 > Step3 Step4 > 10 32 3.133476e-036.075853e-038 4.057531e-01 > 1.613035e-033.956920e-033.077200e-054.390900e-05 > 20 28 5.597685e-044.376530e-0316 4.711146e-03 > 0.00e+004.390998e-032.229600e-052.517100e-05 > 30 27 1.148159e-044.357923e-0322 8.408166e-06 > 0.00e+004.326610e-032.697700e-053.233200e-05 > 40 27 4.036778e-054.388260e-0329 2.529294e-07 > 0.00e+004.329726e-032.713000e-053.558400e-05 > 50 27 1.840383e-054.357373e-0336 1.34e-07 > 0.00e+004.327526e-033.097000e-053.255700e-05 > 53 27 0.00e+001.322382e-0336 -2.842171e-14 > 0.00e+001.327239e-031.420400e-052.043500e-05 > ../test/genmoons_data/logdet_two_moons_n64.txt,53,69,3.330987e-02,0.00e+00,2.229830e+07,0.00e+00,6.694201e+02,0.00e+00 > Iteration JminError Elapsed Corral Duality Gap Step1 Step2 > Step3 Step4 > 10 70 7.739525e-032.389529e-028 1.494829e+00 > 2.975209e-031.873082e-024.713600e-055.837200e-05 > 20 74 3.379192e-032.084753e-0215 3.372041e-01 > 0.00e+002.084637e-024.302400e-053.907800e-05 > 30 76 1.322821e-032.093204e-022
[R] simplest way (set of functions) to parse a file
Hello, What would be the best set of R functions to parse and transform a file? My file looks as shown below. I would like to plot this data and I need to parse it into a single data frame that sorts of "transposes the data" with the following structure: > df <- data.frame(n=c(1,1,2,2),iter=c(1,2,1,2),step=as.factor(c('Step 1', > 'Step2', 'Step 1', 'Step 2')),value=c(10, 10, 10, 10)) > str(df) 'data.frame': 4 obs. of 4 variables: $ n: num 1 1 2 2 $ iter : num 1 2 1 2 $ step : Factor w/ 3 levels "Step 1","Step 2",..: 1 3 1 2 $ value: num 10 10 10 10 n=extracted from the file name "logdet_two_moons_n10.txt" iter=iter step=column Step1, Step2, Step3, Step4 value=value of the specific Step column And this is one possible data frame variation to be able to plot the time proportions for the different steps of my algorithm. TIA, Best regards, Giovanni Iteration JminError Elapsed Corral Duality Gap Step1 Step2 Step3 Step4 2 2 0.00e+001.912976e-031 0.00e+00 1.779780e-037.214600e-051.243600e-052.246700e-05 ../test/genmoons_data/logdet_two_moons_n10.txt,2,2,1.754115e-02,0.00e+00,9.799000e+03,0.00e+00,5.586293e-01,0.00e+00 Iteration JminError Elapsed Corral Duality Gap Step1 Step2 Step3 Step4 4 9 0.00e+001.280841e-032 -7.105427e-15 9.557570e-042.301610e-041.571100e-052.177300e-05 ../test/genmoons_data/logdet_two_moons_n20.txt,4,5,6.062756e-03,0.00e+00,1.365970e+05,0.00e+00,2.253051e+01,0.00e+00 Iteration JminError Elapsed Corral Duality Gap Step1 Step2 Step3 Step4 10 32 3.133476e-036.075853e-038 4.057531e-01 1.613035e-033.956920e-033.077200e-054.390900e-05 20 28 5.597685e-044.376530e-0316 4.711146e-03 0.00e+004.390998e-032.229600e-052.517100e-05 30 27 1.148159e-044.357923e-0322 8.408166e-06 0.00e+004.326610e-032.697700e-053.233200e-05 40 27 4.036778e-054.388260e-0329 2.529294e-07 0.00e+004.329726e-032.713000e-053.558400e-05 50 27 1.840383e-054.357373e-0336 1.34e-07 0.00e+004.327526e-033.097000e-053.255700e-05 53 27 0.00e+001.322382e-0336 -2.842171e-14 0.00e+001.327239e-031.420400e-052.043500e-05 ../test/genmoons_data/logdet_two_moons_n64.txt,53,69,3.330987e-02,0.00e+00,2.229830e+07,0.00e+00,6.694201e+02,0.00e+00 Iteration JminError Elapsed Corral Duality Gap Step1 Step2 Step3 Step4 10 70 7.739525e-032.389529e-028 1.494829e+00 2.975209e-031.873082e-024.713600e-055.837200e-05 20 74 3.379192e-032.084753e-0215 3.372041e-01 0.00e+002.084637e-024.302400e-053.907800e-05 30 76 1.322821e-032.093204e-0221 1.018845e-01 0.00e+002.083170e-024.704100e-055.707100e-05 40 78 1.176950e-032.095179e-0228 2.447970e-02 0.00e+002.088284e-024.890700e-054.955100e-05 50 78 2.233669e-042.050571e-0235 1.573952e-02 0.00e+002.045954e-024.046600e-053.899000e-05 60 78 2.167956e-042.095130e-0239 8.362982e-03 0.00e+002.082586e-026.699700e-058.506400e-05 70 78 2.085968e-042.085355e-0246 5.135190e-03 0.00e+002.083204e-025.432900e-054.078600e-05 80 78 2.570800e-042.044932e-0251 5.470225e-04 0.00e+002.033571e-025.334200e-055.318400e-05 81 78 0.00e+002.099610e-0351 1.421085e-14 0.00e+002.100072e-039.147000e-062.324800e-05 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [slightly OT] le: will a new point shift the solution question
Hello, Is there an R function that given a linear regression solution for a data set will answer in the most efficient way whether a new data point shifts the solution or not? or whether the new solution would differ by less than some error. I need this in the context of an iterative method and such a function would spare a lot of time. The closest answer I can find to this, involves keeping track of the QR and updating it with a row append. I would like to only get a boolean answer because a 'no' answer would spare tons of flops. Many thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] glm predict issue
Hi Ben, Yes thanks you are right, I was able to fix it but first I had to fix the data frame over which I built my model to use numeric for those and then making the grid values also numeric it finally worked thanks! Thank you for your help! Best regards, Giovanni On Dec 26, 2011, at 4:57 PM, Ben Bolker wrote: > Giovanni Azua gmail.com> writes: > >> >> Hello, >> >> I have tried reading the documentation and googling for the answer but > reviewing the online matches I end up >> more confused than before. >> >> My problem is apparently simple. I fit a glm model (2^k experiment), and then > I would like to predict the >> response variable (Throughput) for unseen factor levels. >> >> When I try to predict I get the following error: >>> throughput.pred <- predict(throughput.fit,experiments,type="response") >> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = > object$xlevels) : >> factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000 >> >> Of course these are new factor levels, it is exactly what I am trying to > achieve i.e. extrapolate the values >> of Throughput. >> >> Can anyone please advice? Below I include all details. > > Any predictors that you want to treat as continuous > (which would be the only way you can extrapolate to unobserved > values) should be numeric, not factor variables -- use > > mydata <- transform(mydata, var=as.numeric(var)) > > for example. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] glm predict issue
Hello, I have tried reading the documentation and googling for the answer but reviewing the online matches I end up more confused than before. My problem is apparently simple. I fit a glm model (2^k experiment), and then I would like to predict the response variable (Throughput) for unseen factor levels. When I try to predict I get the following error: > throughput.pred <- predict(throughput.fit,experiments,type="response") Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000 Of course these are new factor levels, it is exactly what I am trying to achieve i.e. extrapolate the values of Throughput. Can anyone please advice? Below I include all details. Thanks in advance, Best regards, Giovanni > # define the extreme (factors and levels) > experiments <- expand.grid(No_databases = seq(1000,100,by=-200), + Partitioning = c("sharding", "replication"), + No_middlewares = seq(500,100,by=-100), + Queue_size = c(100)) > experiments$No_databases <- as.factor(experiments$No_databases) > experiments$Partitioning <- as.factor(experiments$Partitioning) > experiments$No_middlewares <- as.factor(experiments$No_middlewares) > experiments$Queue_size <- as.factor(experiments$Queue_size) > > str(experiments) 'data.frame': 50 obs. of 4 variables: $ No_databases : Factor w/ 5 levels "200","400","600",..: 5 4 3 2 1 5 4 3 2 1 ... $ Partitioning : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 2 2 2 2 2 ... $ No_middlewares: Factor w/ 5 levels "100","200","300",..: 5 5 5 5 5 5 5 5 5 5 ... $ Queue_size: Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1 ... - attr(*, "out.attrs")=List of 2 ..$ dim : Named int 5 2 5 1 .. ..- attr(*, "names")= chr "No_databases" "Partitioning" "No_middlewares" "Queue_size" ..$ dimnames:List of 4 .. ..$ No_databases : chr "No_databases=1000" "No_databases= 800" "No_databases= 600" "No_databases= 400" ... .. ..$ Partitioning : chr "Partitioning=sharding" "Partitioning=replication" .. ..$ No_middlewares: chr "No_middlewares=500" "No_middlewares=400" "No_middlewares=300" "No_middlewares=200" ... .. ..$ Queue_size: chr "Queue_size=100" > head(experiments) No_databases Partitioning No_middlewares Queue_size 1 1000 sharding500100 2 800 sharding500100 3 600 sharding500100 4 400 sharding500100 5 200 sharding500100 6 1000 replication500100 > # or > throughput.fit <- > glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size, + data=throughput) > summary(throughput.fit) Call: glm(formula = log(Throughput) ~ (No_databases * No_middlewares) + Partitioning + Queue_size, data = throughput) Deviance Residuals: Min 1Q Median 3Q Max -2.5966 -0.6612 -0.1944 0.5548 3.2136 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept)5.747010.09127 62.970 < 2e-16 *** No_databases4 0.433090.10985 3.943 8.66e-05 *** No_middlewares2 -1.993740.11035 -18.067 < 2e-16 *** No_middlewares4 -1.230040.10969 -11.214 < 2e-16 *** Partitioningreplication0.332910.06181 5.386 9.15e-08 *** Queue_size100 0.158500.06181 2.564 0.0105 * No_databases4:No_middlewares2 2.715250.15262 17.791 < 2e-16 *** No_databases4:No_middlewares4 1.941910.15226 12.754 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.8921778) Null deviance: 2175.58 on 936 degrees of freedom Residual deviance: 828.83 on 929 degrees of freedom AIC: 2562.2 Number of Fisher Scoring iterations: 2 > throughput.pred <- predict(throughput.fit,experiments,type="response") Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data frame and cumulative sum
Thank you Michael, indeed, my bad, I get so deep in trying to solve the problem that forget to try the most basic help first. Best regards, Giovanni On Dec 7, 2011, at 11:20 PM, R. Michael Weylandt wrote: > ??"cumulative sum" would almost certainly lead you to cumsum with only > a modicum of effort > > Michael > > On Wed, Dec 7, 2011 at 5:13 PM, Giovanni Azua wrote: >> Hello, >> >> I have a data frame that looks like this (containing interarrival times): >> >>> str(df) >> 'data.frame': 18233 obs. of 1 variable: >> $ Interarrival: int 135 806 117 4 14 1 9 104 169 0 ... >>> head(df) >> Interarrival >> 1 135 >> 2 806 >> 3 117 >> 44 >> 5 14 >> 61 >>> >> >> This corresponds to the time differences (in ms) of a poisson arrival >> process where Interarrival{i+1} = time_{i+1} - time_{i} >> >> I want to get the Time bin (in minutes) of every interarrival basically >> something like: >> >> 1) df$Time <- sum(of all df$Interarrival up to "this rownum") # cumulative >> sum >> >> 2) df$Time <- floor(df$Time / 6) + 1 >> >> then I should get the first minute of Interarrival having 1 and so forth. >> The problem is I have no idea how to accomplish 1) in R. >> >> Can anyone advice? >> >> Thanks in advance, >> Best regards, >> Giovanni >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] data frame and cumulative sum
Hello, I have a data frame that looks like this (containing interarrival times): > str(df) 'data.frame': 18233 obs. of 1 variable: $ Interarrival: int 135 806 117 4 14 1 9 104 169 0 ... > head(df) Interarrival 1 135 2 806 3 117 44 5 14 61 > This corresponds to the time differences (in ms) of a poisson arrival process where Interarrival{i+1} = time_{i+1} - time_{i} I want to get the Time bin (in minutes) of every interarrival basically something like: 1) df$Time <- sum(of all df$Interarrival up to "this rownum") # cumulative sum 2) df$Time <- floor(df$Time / 6) + 1 then I should get the first minute of Interarrival having 1 and so forth. The problem is I have no idea how to accomplish 1) in R. Can anyone advice? Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R-latex syntax highlighting?
Hello, Can anyone provide or point me to a good setup for the listings latex package that would produce nice R-syntax highlighting? I am using an example I found in internet for setting up listings like this: \lstset{ language=R, basicstyle=\scriptsize\ttfamily, commentstyle=\ttfamily\color{gray}, numbers=left, numberstyle=\ttfamily\color{red}\footnotesize, stepnumber=1, numbersep=5pt, backgroundcolor=\color{white}, showspaces=false, showstringspaces=false, showtabs=false, frame=single, tabsize=2, captionpos=b, breaklines=true, breakatwhitespace=false, title=\lstname, escapeinside={}, keywordstyle={}, morekeywords={} } But I know that using the color latex package it can look a lot nicer and I guess some people might have configured that before and are willing to share it? I'm also interested in the a similar problem best way to put R code with syntax highlighting into a presentation. I use Apple's Keynote. Many thanks in advance, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 1 vs 2-way anova technical question
On Nov 22, 2011, at 3:52 PM, Liviu Andronic wrote: > On Tue, Nov 22, 2011 at 2:09 PM, Giovanni Azua wrote: >> Mr. Gunter did not read/understand my problem, and there were no useful tips >> but only ad hominem attacks. By your side-taking I suspect you are in the >> same "party club" if you want to defend him maybe you should start by "tying >> better your dog" so to speak. >> > I believe that most of the readers of this thread got put off by your > offending and misplaced remarks. To echo other posters, it would be > nice to get your e-mail address banned from the list. If needed go ahead and do so, your blockade won't stop my learning efforts. I don't see any concrete reason why I should be taking bullets from random people who fancy themselves with superiority and arrogance. And as usual these bullies always seem to win. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 1 vs 2-way anova technical question
On Nov 22, 2011, at 10:35 AM, Joshua Wiley wrote: > It is true the way you use general lists is not our business, but the > R-help list is a community and there are community rules. One of I meant that my use of the lists is not of __his__ business I wasn't referring to you nor other people in this list. Ok the reading skills remark starts to get recursive ... and btw the OP even though marked in the subject as OT was not entirely so i.e. use of R formula etc. > those is not to ask questions that are primarily about a lack of > statistical understanding (although they are not strictly prohibited). The "lack of statistical understanding" was his own judgmental conclusion which he should have kept for himself, if he starts throwing stones around he should not expect to get flowers back. Previous to this I also received some totally out of place private emails from him and I am not the kind of person that takes B.S. from anyone, he got the wrong guy. And in fact his great fallacious conclusions originated from his lack of reading skills, besides I don't really think he read it at all but just try to run me down with his attacks and unwelcome remarks. > Your original post suggests that you knew this, "I know there is > plenty of people in this group who can give me a good answer :)" but > chose to ignore it. Despite this, Bert was generous enough to give > you some suggestions, perhaps not what you wanted but useful tips > nonetheless. > My original post only suggested that I know there is people with knowledge about this practical applied statistics problems, nothing else. Before addressing the list I talked to two TA's and one professor. Their help was generic but helpful nevertheless. I preferred to address the question to people with practical working knowledge of ANOVA (I don't think there is a huge population in this area) and the best place I can think of is the R list, the place where I would be subscribed if I worked on these problems every day. Statistics lists will be full of college students who will have equivalent knowledge to what I already have and there they will probably only agree to say yes your QQ looks non-normal and heavy tailed which is what I already knew ... this is a similar answer I got from a TA and couple of student friends doing the MSc in Statistics track. Mr. Gunter did not read/understand my problem, and there were no useful tips but only ad hominem attacks. By your side-taking I suspect you are in the same "party club" if you want to defend him maybe you should start by "tying better your dog" so to speak. > Regarding your suggestion that the list be split into a "beginner" and > "advanced" list, while that is one option, your original question was > appropriate for neither. It was, however, very appropriate for a > statistics list (e.g., http://stats.stackexchange.com/). Thank you for the link, it looks very promising. Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 1 vs 2-way anova technical question
On Nov 21, 2011, at 8:31 PM, Bert Gunter wrote: > we disagree is that I think data analysts with limited statistical > backgrounds should consult with local statisticians instead of trying > to muddle through on their own thru lists like this. This is not meant I think that people lacking reading skills should not be subscribed in lists like this one, bullying and creating confusion around. I will asks as many times as I want/need and the way I use lists if none of your f. b. > to be arrogance on my part -- though it may seem to come across that > way -- but rather a plea for good science. I believe that bad > statistics --> bad science, a problem that I see as pervasive and > inimical to scientific progress, especially in today's data saturated > world. > But enough of my off topic B.S. Please reply privately to not waste > yet more space here (positively or negatively -- stone throwers need > to catch them, too). > You are actually full of your off topic prime matter, you arrogant prick. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 1 vs 2-way anova technical question
Hello Rob, Thank you for your suggestions. I tried glm too without success. Anyhow I include all the information just in case someone with good knowledge can give me a hand with this. I take log of the response variable because: - its values span across multiple orders of magnitudes - the diagnostic plots e.g. QQ, residuals vs fitted etc do improve with that. Below I include: 1) general summary of my data 2) 1-way anova and summary of the model 3) 4-way anova and summary of the model Attached: a) Overview of the data (where main interactions occur i.e. No_databases and No_middlewares) b) diagnostic plots for 2) Here the Normality assumption of the residuals looks reasonable c) diagnostic plots for 3) Here the Normality assumption of the residuals does not seem to hold so it invalidates the 4-way aov model? I tried glm and it delivers similar results as 3) My impression is that my system is heavily polluted with outliers one can see that from plot a) how much the mean and the median differ due to the outliers. That's just the way the system I implemented behaves. Btw the system is a multi-tiered architecture that I developed in Java from scratch that includes XA and different data access and partitioning patterns. I need to quantitatively analyze and draw conclusion from this system. Most of my class mates just make it real simple: make 2^k experiments take one grand mean out of each experiment and do the ANOVA on those means i.e. 1-repetition, compute the fraction of variation and that's it. I am trying to model it more deeply by checking model assumptions, etc. Many thanks in advance, Best regards, Giovanni > str(throughput) 'data.frame': 479 obs. of 9 variables: $ Time : num 7 8 9 10 11 12 13 14 15 16 ... $ Throughput: int 155 155 154 157 155 214 4631 2118 136 132 ... $ Workload : chr "All" "All" "All" "All" ... $ No_databases : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ... $ Partitioning : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 1 1 1 1 ... $ No_middlewares: Factor w/ 3 levels "1","2","4": 1 1 1 1 1 1 1 1 1 1 ... $ Queue_size: Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ... $ No_clients: Factor w/ 1 level "64": 1 1 1 1 1 1 1 1 1 1 ... $ Experimental_error: Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ... > summary(throughput) Time Throughput Workload No_databases Partitioning No_middlewares Min. : 7.00 Min. : 35.0 Length:479 1:239sharding :240 1:160 1st Qu.:11.50 1st Qu.: 50.5 Class :character 4:240 replication:239 2:159 Median :16.00 Median : 744.0 Mode :character 4:160 Mean :16.48 Mean : 830.3 3rd Qu.:21.00 3rd Qu.:1205.5 Max. :26.00 Max. :4631.0 Queue_size No_clients Experimental_error 40 :24064:479 1:479 100:239 ## ### ## ## ANOVA "one-way" interaction ## ## ### > throughput.aov <- > aov(log(Throughput)~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput) > throughput.aov Call: aov(formula = log(Throughput) ~ No_databases + Partitioning + No_middlewares + Queue_size, data = throughput) Terms: No_databases Partitioning No_middlewares Queue_size Residuals Sum of Squares 521.5264 5.697150.5814 0.4628 476.6826 Deg. of Freedom11 2 1 473 Residual standard error: 1.003885 Estimated effects may be unbalanced > summary(throughput.aov) Df Sum Sq Mean Sq F valuePr(>F) No_databases 1 521.53 521.53 517.4974 < 2.2e-16 *** Partitioning 1 5.705.70 5.6530 0.01782 * No_middlewares 2 50.58 25.29 25.0953 4.381e-11 *** Queue_size 1 0.460.46 0.4592 0.49833 Residuals 473 476.681.01 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > ## ### ## ## ANOVA 4-way interaction ## ## ### > throughput.aov <- > aov(log(Throughput)~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput) > throughput.aov Call: aov(formula = log(Throughput) ~ No_databases * Partitioning * No_middlewares * Queue_size, data = throughput) Terms: No_databases Partitioning No_middlewares Queue_size No_databases:Partitioning Sum of Squares 521.5264 5.697150.5814 0.4628 96.9198 Deg. of Freedom
Re: [R] [OT] 1 vs 2-way anova technical question
Hello Bert, Thank you for taking the time to try to answer. 1) I know this, however if one is interested in only interaction between two specific factors then in R one uses I(A*B*C) meaning 3-way anova for that and not the implicit 2-ways that would otherwise be computed. 2) True, but it fails. 3) No, I don't have any factors with one level, I never said that. It would not be a 2^k experiment otherwise, my OP states this clearly, this is a 2^k experimental design ___2___ 4) this is only your judgmental attitude that many people unfortunately have in some of these lists, focussing on ad-hominem judgements or even attacks to try to prove their superiority without actually answering nor adding any value to the question at hand. I have taken many graduate courses in subjects that have all Statistics in the title and passed all of them. However, as an experienced Software Engineer working for more than 10 years in the field, I can tell you that there is a huge difference between solving toy problems to implementing real-life complex projects. Same rules apply here, one thing is the toy examples one finds in R books and course exercises and another totally different story is the real life data I am trying to model. I'm a student in the quantitative part and learning, so I do have some gaps, I am curious and trying to learn and I think there is no shame in that. If this makes you upset maybe you should ask to split the list in two or more: "Advanc! ed-PhD-black-belt-10th-dan-in-Statistics-and-R level" list and "newbies" list. Best regards, Giovanni On Nov 21, 2011, at 3:55 PM, Bert Gunter wrote: > Giovanni: > > 1. Please read ?formula and/or An Introduction to R for how to specify > linear models in R. > > 2. Correct specification of what you want (if I understand correctly) is > log(R) ~ A*B + C + D > > 3. ... which presumably will also fail because some of your factors > have only one level, which means that you cannot use them in your > model. > > 4. ... which, in turn, suggests you don't know what your doing > statistically and should seek local assistance, especially in trying > to interpret a fit to an unbalanced model (you can't do it as you > probably think you can). > > I should say in your defense that posts on this list indicate that > point 4 is a widely shared problem among posters here. > > Cheers, > Bert > > On Mon, Nov 21, 2011 at 5:02 AM, Giovanni Azua wrote: >> Hello, >> >> Couple of clarifications: >> - A,B,C,D are factors and I am also interested in possible interactions but >> the model that comes out from aov R~A*B*C*D violates the model assumptions >> - My 2^k is unbalanced i.e. missing data and an additional level I also >> include in one of the factors i.e. C >> - I was referring in the OP to the 4-way interactions and not 2-way, I'm >> sorry for my confusion. >> - I tried to create an aov model with less interactions this way but I get >> the following error: >> >> model.aov <- aov(log(R)~A+B+I(A*B)+C+D,data=throughput) >> Error in `contrasts<-`(`*tmp*`, value = "contr.treatment") : >> contrasts can be applied only to factors with 2 or more levels >> In addition: Warning message: >> In Ops.factor(A, B) : * not meaningful for factors >> >> Here I was trying to say: do a one-way anova except for the A and B factors >> for which I would like to get their 2-way interactions ... >> >> Thanks in advance, >> Best regards, >> Giovanni >> >> On Nov 21, 2011, at 12:04 PM, Giovanni Azua wrote: >> >>> Hello, >>> >>> I know there is plenty of people in this group who can give me a good >>> answer :) >>> >>> I have a 2^k model where k=4 like this: >>> Model 1) R~A*B*C*D >>> >>> If I use the "*" in R among all elements it means to me to explore all >>> interactions and include them in the model i.e. I think this would be the >>> so called 2-way anova. However, if I do this, it leads to model violations >>> i.e. the homoscedasticity is violated, the normality assumption of the >>> sample errors i.e. residuals is violated etc. I tried correcting the issues >>> using different standard transformations: log, sqrt, Box-Cox forms etc but >>> none really improve the result. In this case even though the model >>> assumptions do not hold, some of the interactions are found to >>> significatively influence the response variable. But then shall I trust the >>> results of this Model 1) given that the assumptions do not hold? >>> >>> Then I try this other model where I exclude
Re: [R] [OT] 1 vs 2-way anova technical question
Hello, Couple of clarifications: - A,B,C,D are factors and I am also interested in possible interactions but the model that comes out from aov R~A*B*C*D violates the model assumptions - My 2^k is unbalanced i.e. missing data and an additional level I also include in one of the factors i.e. C - I was referring in the OP to the 4-way interactions and not 2-way, I'm sorry for my confusion. - I tried to create an aov model with less interactions this way but I get the following error: model.aov <- aov(log(R)~A+B+I(A*B)+C+D,data=throughput) Error in `contrasts<-`(`*tmp*`, value = "contr.treatment") : contrasts can be applied only to factors with 2 or more levels In addition: Warning message: In Ops.factor(A, B) : * not meaningful for factors Here I was trying to say: do a one-way anova except for the A and B factors for which I would like to get their 2-way interactions ... Thanks in advance, Best regards, Giovanni On Nov 21, 2011, at 12:04 PM, Giovanni Azua wrote: > Hello, > > I know there is plenty of people in this group who can give me a good answer > :) > > I have a 2^k model where k=4 like this: > Model 1) R~A*B*C*D > > If I use the "*" in R among all elements it means to me to explore all > interactions and include them in the model i.e. I think this would be the so > called 2-way anova. However, if I do this, it leads to model violations i.e. > the homoscedasticity is violated, the normality assumption of the sample > errors i.e. residuals is violated etc. I tried correcting the issues using > different standard transformations: log, sqrt, Box-Cox forms etc but none > really improve the result. In this case even though the model assumptions do > not hold, some of the interactions are found to significatively influence the > response variable. But then shall I trust the results of this Model 1) given > that the assumptions do not hold? > > Then I try this other model where I exclude the interactions (is this the > 1-way anova?): > Model 2) R~A+B+C+D > > In this one the model assumptions hold except the existence of some outliers > and a slightly heavy tail in the QQ-plot. > > Given that the assumptions for Model 1) do not hold, I assume I should ignore > the results altogether for Model 1) or? or instead can I safely use the Sum > Sq. of Model 1) to get my table of percent of variations? > > This to me was a bit counter-intuitive since I assumed that if there was > collinearity among factors (and there is e.g. I(A*B*C)) the Model 1) and I > included those interactions, my model would be more accurate ... ok this > turned into a brand new topic of model selection but I am mostly interested > in the question: if model is violated can I or must I not use the results > e.g. Sum Sqr for that model? > > Can anyone advice please? > > btw I have bought most books on R and statistical analysis. I have researched > them all and the ANOVA coverage is very shallow in most of them specially in > the R-sy ones, they just offer a slightly pimped up version of the R-help. > > I am also unofficially following a course on ANOVA from the university I am > registered in and most examples are too simplistic and either the assumptions > just hold easily or the assumptions don't hold and nothing happens. > > Thanks in advance, > Best regards, > Giovanni > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [OT] 1 vs 2-way anova technical question
Hello, I know there is plenty of people in this group who can give me a good answer :) I have a 2^k model where k=4 like this: Model 1) R~A*B*C*D If I use the "*" in R among all elements it means to me to explore all interactions and include them in the model i.e. I think this would be the so called 2-way anova. However, if I do this, it leads to model violations i.e. the homoscedasticity is violated, the normality assumption of the sample errors i.e. residuals is violated etc. I tried correcting the issues using different standard transformations: log, sqrt, Box-Cox forms etc but none really improve the result. In this case even though the model assumptions do not hold, some of the interactions are found to significatively influence the response variable. But then shall I trust the results of this Model 1) given that the assumptions do not hold? Then I try this other model where I exclude the interactions (is this the 1-way anova?): Model 2) R~A+B+C+D In this one the model assumptions hold except the existence of some outliers and a slightly heavy tail in the QQ-plot. Given that the assumptions for Model 1) do not hold, I assume I should ignore the results altogether for Model 1) or? or instead can I safely use the Sum Sq. of Model 1) to get my table of percent of variations? This to me was a bit counter-intuitive since I assumed that if there was collinearity among factors (and there is e.g. I(A*B*C)) the Model 1) and I included those interactions, my model would be more accurate ... ok this turned into a brand new topic of model selection but I am mostly interested in the question: if model is violated can I or must I not use the results e.g. Sum Sqr for that model? Can anyone advice please? btw I have bought most books on R and statistical analysis. I have researched them all and the ANOVA coverage is very shallow in most of them specially in the R-sy ones, they just offer a slightly pimped up version of the R-help. I am also unofficially following a course on ANOVA from the university I am registered in and most examples are too simplistic and either the assumptions just hold easily or the assumptions don't hold and nothing happens. Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] anova to pca
Hello, I would like to reinforce my anova results using PCA i.e. which factor are most important because they explain most of the variance (i.e. signal) of my 2^k*r experiment. However, I get the following error while trying to run PCA: > throughput.prcomp <- > prcomp(~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput) Error in prcomp.formula(~No_databases + Partitioning + No_middlewares + : PCA applies only to numerical variables What is the most R-like concise way to map/transform those factor values into numerical values in a suitable way for PCA analysis? My first attempt would be: # C++ "style" throughput$No_databases_num <- (throughput$No_databases == 1) ? -1 : 1 throughput$Partitioning_num <- (throughput$Partitioning == "sharding") ? -1 : 1 etc. How can I do this in the R way? Would these -1, 1 be sensible for a PCA analysis or it just doesn't matter? How about a factor for which I have 3 levels? -1, 0 and 1? Many thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] aov how to get the SST?
Hello, I currently run aov in the following way: > throughput.aov <- > aov(log(Throughput)~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput) > summary(throughput.aov) Df Sum Sq Mean Sq F valuePr(>F) No_databases 1 184.68 184.675 136.6945 < 2.2e-16 *** Partitioning 1 70.16 70.161 51.9321 2.516e-12 *** No_middlewares 2 44.22 22.110 16.3654 1.395e-07 *** Queue_size 1 0.40 0.395 0.29260.5888 Residuals 440 594.44 1.351 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 In order to compute the fraction of variation I need to know the total Sum Sq. and I assume it is like this: SST = SS-No_databases + SS-Partitioning + SS-No_middlewares + SS-Queue_size = 184.68 + 70.16 + 44.22 + 0.40 = 299.46 So the fraction of variation explained by the No_databases would be: SST/SS-No_databases = 184.68/299.46 = 0.6167101 ... and finally I can say that the No_databases explains 61.6% of the variation in Throughput. Is this correct? if so, how can I do the same calculations using R? I haven't found the way to extract the Sum Sq out of the throughput.aov Object. Is there a function to get this 0.6167101 and 61.6% results without having to do it manually? even better if I get a table containing all these fraction of variations? Since this is a 2^k experiment, I cant see how the Residuals fit in the formula. When I introduce replications (blocking factor) then I can also include a SSE term into the SST calculation. TIA, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] boxplot strange behavior
Hello, I generate box plots from my data like this: qplot(x=xxx,y=column,data=data,geom="boxplot") + xlab("xxx") + ylab(ylabel) + theme_bw() + scale_y_log10() + geom_jitter(alpha=I(1/10)) The problem is that I see lot of points above the maximum at the same level as some outliers. It looks very weird as I expected the outliers to be "few" and specially not see any points other than the outliers below the minimum or above the maximum indicator. Can anyone explain what's going on? Attached I send a snapshot, see the second level having some outliers, separate from the outliers there are also some points which seem not to be outliers and that are above the maximum indicator? is this a bug or am I missing anything? TIA, Best regards, Giovanni PS: I sent this to the ggplot2 list too (sorry for the double post but I am kind of under pressure with this) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] aov output question
Hello, I currently get anova results out of the aov function (see below) I use the model.tables and I believe it gives me back the model parameters of the fit (betas), however I don't see the intercept (beta_0) and don't understand what the "rep" output means and there is no description in the documentation. Another question: is there a function that outputs the results in a more meaningful way e.g. show the percentage of variation of each factor towards the response I believe the formula would be something like: for factor X: (Sum_Sq_X / Sum_Sq_Total)*100 Thanks in advance, Best regards, Giovanni > #throughput.aov <- > aov(Throughput~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput) > throughput.aov Call: aov(formula = Throughput ~ No_databases + Partitioning + No_middlewares + Queue_size, data = throughput) Terms: No_databases Partitioning No_middlewares Queue_size Residuals Sum of Squares 43146975 73949061130 20710 195504055 Deg. of Freedom11 2 1 433 Residual standard error: 671.9453 Estimated effects may be unbalanced > summary(throughput.aov) DfSum Sq Mean Sq F valuePr(>F) No_databases 1 43146975 43146975 95.5614 < 2.2e-16 *** Partitioning 1 7394 7394 0.01640.8982 No_middlewares 2 9061130 4530565 10.0342 5.497e-05 *** Queue_size 1 2071020710 0.04590.8305 Residuals 433 195504055 451511 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > model.tables(throughput.aov,type="effects",se=TRUE) Design is unbalanced - use se.contrast() for se's Tables of effects No_databases 1 4 -317.1 310 rep 217.0 222 Partitioning sharding replication 4.303 -3.91 rep 209.000 230.00 No_middlewares 1 2 4 -97.93 -108.2 199 rep 139.00 150.0 150 Queue_size 40 100 -6.852 6.883 rep 220.000 219.000 > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] 2^k*r (with replications) experimental design question
Hi Denis, Thank you again :) what do you exactly mean with "blocking factor", that it will be like the others? I'd prefer not to treat the replicates as random but rather account for the experimental error using the replicates. Ahhh I see what you mean, so the experimental error will show up as the SS of my new variable "Replicate" ... great! Thank you! Best regards, Giovanni On Nov 14, 2011, at 2:38 AM, Dennis Murphy wrote: > I'm guessing you have nine replicates of a 2^5 factorial design with a > couple of missing values. If so, define a variable to designate the > replicates and use it as a blocking factor in the ANOVA. If you want > to treat the replicates as a random rather than a fixed factor, then > look into the nlme or lme4 packages. > > HTH, > Dennis > > On Sun, Nov 13, 2011 at 4:33 PM, Giovanni Azua wrote: >> Hello, >> >> I have one replication (r=1 of the 2^k*r) of a 2^k experimental design in >> the context of performance analysis i.e. my response variables are >> Throughput and Response Time. I use the "aov" function and the results look >> ok: >> >>> str(throughput) >> 'data.frame': 286 obs. of 7 variables: >> $ Time : int 6 7 8 9 10 11 12 13 14 15 ... >> $ Throughput: int 42 44 33 41 43 40 37 40 42 37 ... >> $ No_databases : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ... >> $ Partitioning : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 >> 1 1 1 1 ... >> $ No_middlewares: Factor w/ 2 levels "2","4": 1 1 1 1 1 1 1 1 1 1 ... >> $ Queue_size: Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ... >> $ No_clients: Factor w/ 1 level "128": 1 1 1 1 1 1 1 1 1 1 ... >>> head(throughput) >> Time Throughput No_databases Partitioning No_middlewares Queue_size >> 16 421 sharding 2 40 >> 27 441 sharding 2 40 >> 38 331 sharding 2 40 >> 49 411 sharding 2 40 >> 5 10 431 sharding 2 40 >> 6 11 401 sharding 2 40 >>> >>> throughput.aov <- >>> aov(Throughput~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput) >>> summary(throughput.aov) >> DfSum Sq Mean Sq F valuePr(>F) >> No_databases 128488651 28488651 53.4981 2.713e-12 *** >> Partitioning17168771687 0.1346 0.713966 >> No_middlewares 1 5624454 5624454 10.5620 0.001295 ** >> Queue_size 1 5089250892 0.0956 0.757443 >> Residuals 281 149637226 532517 >> --- >> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >>> >> >> This is somehow what I expected and I am happy, it is saying that the >> Throughput is significatively affected firstly by the number of database >> instances and secondly by the number of middleware instances. >> >> The problem is that I need to integrate multiple replications of this same >> 2^k so I can also account for experimental error i.e. the _r_ of 2^k*r but I >> can't see how to integrate the _r_ term into the data and into the aov >> function parameters. Can anyone advice? >> >> TIA, >> Best regards, >> Giovanni >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] 2^k*r (with replications) experimental design question
Hello, I have one replication (r=1 of the 2^k*r) of a 2^k experimental design in the context of performance analysis i.e. my response variables are Throughput and Response Time. I use the "aov" function and the results look ok: > str(throughput) 'data.frame': 286 obs. of 7 variables: $ Time : int 6 7 8 9 10 11 12 13 14 15 ... $ Throughput: int 42 44 33 41 43 40 37 40 42 37 ... $ No_databases : Factor w/ 2 levels "1","4": 1 1 1 1 1 1 1 1 1 1 ... $ Partitioning : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 1 1 1 1 1 ... $ No_middlewares: Factor w/ 2 levels "2","4": 1 1 1 1 1 1 1 1 1 1 ... $ Queue_size: Factor w/ 2 levels "40","100": 1 1 1 1 1 1 1 1 1 1 ... $ No_clients: Factor w/ 1 level "128": 1 1 1 1 1 1 1 1 1 1 ... > head(throughput) Time Throughput No_databases Partitioning No_middlewares Queue_size 16 421 sharding 2 40 27 441 sharding 2 40 38 331 sharding 2 40 49 411 sharding 2 40 5 10 431 sharding 2 40 6 11 401 sharding 2 40 > > throughput.aov <- > aov(Throughput~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput) > summary(throughput.aov) DfSum Sq Mean Sq F valuePr(>F) No_databases 128488651 28488651 53.4981 2.713e-12 *** Partitioning17168771687 0.1346 0.713966 No_middlewares 1 5624454 5624454 10.5620 0.001295 ** Queue_size 1 5089250892 0.0956 0.757443 Residuals 281 149637226 532517 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > This is somehow what I expected and I am happy, it is saying that the Throughput is significatively affected firstly by the number of database instances and secondly by the number of middleware instances. The problem is that I need to integrate multiple replications of this same 2^k so I can also account for experimental error i.e. the _r_ of 2^k*r but I can't see how to integrate the _r_ term into the data and into the aov function parameters. Can anyone advice? TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] issue plotting TukeyHSD
Hello, When I try to use TukeyHSD in the following way it shows the confidence interval corresponding to the last factor only. > throughput.aov <- > aov(Throughput~No_databases+Partitioning+No_middlewares+Queue_size,data=throughput) plot(TukeyHSD(throughput.aov)) # I expected here to see the confidence intervals for all factors but see only the last. OTOH this one works but then it is unreadable due to the long labels of combined effects in the Y-axis ticks. > throughput.aov <- > aov(Throughput~No_databases*Partitioning*No_middlewares*Queue_size,data=throughput) TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] 2^k experiment generator
Never mind, found it, it is the expand.grid function. On Nov 13, 2011, at 3:25 PM, Giovanni Azua wrote: > Hello, > > While looking for info on 2^k experimental design and anova I remember I saw > somewhere there was a function to generate all the experiments. I can't find > the function anymore can anyone suggest? > > The function takes as input the factors and levels and generates all the > experiments. I know I can do it myself using recursion but I want to avoid 1) > reinventing the wheel and 2) making mistakes while at it. > > TIA, > Best regards, > Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] dev.new() within a loop
On Nov 13, 2011, at 3:23 PM, David Winsemius wrote: >>> Please read both my comments and the FAQ more carefully . You are >>> inadequately considering the information that has been offered to you. >>> >> Ok you wanted to make sure I have to read the FAQ well I didn't have to :) >> Googling using your suggestion found relevant matches and now it works. > > Where does this resistance to reading the FAQ come from? It is not resistance, the FAQ is very helpful for basic general questions but it can not cover all details. Sometimes it is very hard to find a specific detailed answer within a general FAQ. I have read the FAQ, what makes you think I didn't before? I just could not find the information I was looking for. >> I had to wrap the ggplot call within a "print" for my loop to work which is >> IMHO one of the most obfuscated use-cases for using print I have bumped into. > > It is a case of isolating the functional aspects of the plot-construction > process from the side-effects of interacting with a graphics device. > >> So every user-defined function that try to plot anything has to be wrapped >> inside a print just in case it ever gets called from within a loop > > That is not how I understand it. One reason is so there is an object in the > workspace that can be later modified. And I suspect the authors (and I am not > one of them) imagined that there may be multiple steps in creation of the > object, not all of which should result in a plot appearing if this is being > done programatically. This is especially appropriate (it would seem to me) > for the ggplot plotting model, which adds a variety of layers to a core > object. It is also imagined that you may be sending this object to one of a > variety of devices. > Ok bottom line is always wrap the plot call whatever it is within a print for the just in case. Cheers, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] 2^k experiment generator
Hello, While looking for info on 2^k experimental design and anova I remember I saw somewhere there was a function to generate all the experiments. I can't find the function anymore can anyone suggest? The function takes as input the factors and levels and generates all the experiments. I know I can do it myself using recursion but I want to avoid 1) reinventing the wheel and 2) making mistakes while at it. TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] dev.new() within a loop
Hello David, On Nov 13, 2011, at 5:20 AM, David Winsemius wrote: >> However, when executing plot_raw which invokes dev.new(..) all windows come >> out blank whereas if I execute each file outside of a loop then I can see >> the plots properly. > > Perhaps ...(you did not say what package this plot_raw function comes from) > ... Read the FAQ about why lattice plot don't print. (It applies to all grid > based plotting functions.) > plot_raw is my own function which just calls ggplot2. So basically I am not using Lattice. What can I do differently to avoid the new windows coming back empty? Thanks in advance, Best regards, Giovanni plot_raw <- function(data,connect=TRUE,y_break=500,y_top=-1,label="") { dev.new() title <- paste(label, sep="") if (y_top == -1) { y_top <- max(data$Y) } if (!decouple) { # add fake group data$Workload <- 'All' } p <- ggplot(data,aes(x=Time,y=Y,group=Workload,shape=Workload,colour=Workload)) + geom_point(fill="white", size=3) + scale_y_continuous(breaks=seq(0,max(data$Y),y_break), limits=c(0, y_top)) + scale_y_continuous(breaks=seq(0,y_limit_top(data$Y,data$se), y_break_step(data$Y,data$se)), limits=c(0, y_limit_top(data$Y,data$se))) + opts(title=title) + theme_bw() + scale_x_continuous(breaks=data$Time, labels=as.character(data$Time)) if (connect) { p + geom_line() } else { p } } [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] dev.new() within a loop
Hello, I have a loop where I iterate performance data files within a folder, parse and plot them in one shot (see below). However, when executing plot_raw which invokes dev.new(..) all windows come out blank whereas if I execute each file outside of a loop then I can see the plots properly. What's wrong here? Thanks in advance, Best regards, Giovanni # given a directory name, it will iterate all files that match the given pattern #basedir <- "/Users/bravegag/code/asl11/data/2k-r1-test-2011_data/" basedir <- "/Users/bravegag/code/asl11/data/nclients_2_128-2010_data/" pattern <- paste("logs.*cl\\-.*mw\\-.*db\\-.*\\-client\\.dat",sep="") all_files <- dir(path=basedir, pattern=pattern) throughput <- NULL response <- NULL #file_name <- all_files[1] # iterate all files for (file_name in all_files) { print(paste("processing", file_name, "...")) df <- read.table(paste(basedir, file_name, sep="")) # read the data as a data frame names(df)[names(df)=="V1"] <- "Time" names(df)[names(df)=="V2"] <- "Partitioning" names(df)[names(df)=="V3"] <- "Workload" names(df)[names(df)=="V4"] <- "Runtime" # get rid of first and last n minutes df <- subset(df, df$Time > warmup_cooldown_minutes) df <- subset(df, df$Time < (max(df$Time) - warmup_cooldown_minutes)) # = # Throughput # = if (decouple) { dft <- aggregate(x=df$Runtime, by=list(df$Time,df$Workload), FUN=length) names(dft)[names(dft)=="Group.1"] <- "Time" names(dft)[names(dft)=="Group.2"] <- "Workload" names(dft)[names(dft)=="x"] <- "Y" } else { dft <- aggregate(x=df$Runtime, by=list(df$Time), FUN=length) names(dft)[names(dft)=="Group.1"] <- "Time" names(dft)[names(dft)=="x"] <- "Y" } dft$se <- 0 plot_raw(dft,connect=TRUE,label=file_name) } [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] 2^k*r experimental design and anova
Hello, Can anyone point me to an online tutorial or book containing the easiest way to do ANOVA over the result data from a 2^k*r experiment. It is not clear to me if I can pass the raw data corresponding to each experiment or just the summarized data i.e. mean, sse, std, etc. I would like to get the: - box plot showing the effect for the different factors and levels - plot showing whether there are interaction between every pair of factors - see the results in tabular form showing the percent of influence of each factor and since this is "r" see also the percent due to experimental error. I would very much prefer if the plotting is using ggplot2. TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] binning runtimes
Hello, Suppose I have the dataset shown below. The amount of observations is too massive to get a nice geom_point and smoother on top. What I would like to do is to bin the data first. The data is indexed by Time (minutes from 1 to 120 i.e. two hours of System benchmarking). Option 1) group the data by Time i.e. minute 1, minute 2, etc and within each group create bins of N consecutive observations and average them into one observation, the bins become the new data points to use for the geom_point plot. How can I do this? Shingle? how to do that? Option 2) Another option is to again group by Time i.e. minute 1, minute 2, etc and within each group draw a random observation to be the representative for the corresponding bin. I could not clearly see how to use Random. > dfs <- subset(df, Partitioning == "Sharding") > head(dfs) Time Partitioning Workload Runtime 11 ShardingQuery3301 21 ShardingQuery3268 31 ShardingQuery2878 41 ShardingQuery2819 51 ShardingQuery3310 61 ShardingQuery3428 > str(dfs) 'data.frame': 102384 obs. of 4 variables: $ Time: int 1 1 1 1 1 1 1 1 1 1 ... $ Partitioning: Factor w/ 2 levels "Replication",..: 2 2 2 2 2 2 2 2 2 2 ... $ Workload: Factor w/ 2 levels "Query","Refresh": 1 1 1 1 1 1 1 1 1 1 ... $ Runtime : int 3301 3268 2878 2819 3310 3428 2837 2954 2902 2936 ... > Many thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] code review: is it too much to ask?
Hello all, I really appreciate how helpful the people in this list are. Would it be too much to ask to send a small script to have it peer-reviewed? to make sure I am not making blatant mistakes? The script takes an experiment.dat as input and generates system Throughput using ggplot2. It works now ... [sigh] but I have this nasty feeling that I might be doing something wrong :). Changing "samples" i.e. number of samples per group produces arbitrarily different results, I basically increased it (until 9) until there were no strongly deterministic periodicities. This is not a full-fledge experiment but just a preliminary report that will show I have implemented a healthy system. Proper experimental analysis comes after varying factors according to the 2^k*r experimental design etc Some key points I would like to find out: - aggregation is not breaking the natural order of the measurements i.e. if there are 20 runtimes taken in that order, and I make groups of 10 measurements (to compute statistics on them) the first group must contain the first 10 runtimes and the second group must contain the second 10 runtimes. I am not sure if the choice of aggregation etc is respecting this. - I am not sure if it is best to do the binning by filling the bins by time intervals of by number of observations. Your help will be greatly appreciated! I have the data too and the plots look very nice but it is a 4mb file. TIA Best regards, Giovanni # = # Advanced Systems Lab # Milestone 1 # Author: Giovanni Azua # Date: 22 October 2011 # = rm(list=ls())# clear workspace library(boot)# use boot library library(ggplot2) # use ggplot2 library library(doBy)# use doBy library # = # ETL Step # = data_file <- file("/Users/bravegag/code/asl11/trunk/report/experiment.dat") df <- read.table(data_file) # reads the data as data frame class(df)# show the class to be 'list' names(df)# data is prepared correcly in Python str(df) head(df) names(df)[names(df)=="V1"] <- "Time" # change column names names(df)[names(df)=="V2"] <- "Partitioning" names(df)[names(df)=="V3"] <- "Workload" names(df)[names(df)=="V4"] <- "Runtime" str(df) head(df) # = # Define utility functions # = se <- function(x) sqrt(var(x)/length(x)) sst <- function(x) sum(x-mean(x))^2 ## COPIED FROM ## http://wiki.stdout.org/rcookbook/Graphs/Plotting%20means%20and%20error%20bars%20%28ggplot2%29 ## * ## Summarizes data. ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%). ## If there are within-subject variables, calculate adjusted values using method from Morey (2008). ## data: a data frame. ## measurevar: the name of a column that contains the variable to be summariezed ## groupvars: a vector containing names of columns that contain grouping variables ## na.rm: a boolean that indicates whether to ignore NA's ## conf.interval: the percent range of the confidence interval (default is 95%) summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) { require(doBy) # New version of length which can handle NA's: if na.rm==T, don't count them length2 <- function (x, na.rm=FALSE) { if (na.rm) sum(!is.na(x)) else length(x) } # Collapse the data formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ ")) datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm) # Rename columns names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- measurevar names(datac)[ names(datac) == paste(measurevar, ".sd", sep="&q
[R] summarizing a data frame i.e. count -> group by
Hello, This is one problem at the time :) I have a data frame df that looks like this: time partitioning_mode workload runtime 1 1 shardingquery 607 2 1 shardingquery 85 3 1 shardingquery 52 4 1 shardingquery 79 5 1 shardingquery 77 6 1 shardingquery 67 7 1 shardingquery 98 8 1 sharding refresh2932 9 1 sharding refresh2870 101 sharding refresh2877 111 sharding refresh2868 121 replicationquery2891 131 replicationquery2907 141 replicationquery2922 151 replicationquery2937 and if I could use SQL ... omg! I really wish I could! I would do exactly this: insert into throughput select time, partitioning_mode, count(*) from data.frame group by time, partitioning_mode My attempted R versions are wrong and produce very cryptic error message: > throughput <- aggregate(x=df[,c("time", "partitioning_mode")], > by=list(df$time,df$partitioning_mode), count) Error in `[.default`(df2, u_id, , drop = FALSE) : incorrect number of dimensions > throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count) Error in `[.default`(df2, u_id, , drop = FALSE) : incorrect number of dimensions >throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), FUN=count) I cant comprehend what comes out from this one ... :( and I thought C++ template errors were the most cryptic ;P Many many thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unfold list (variable number of columns) into a data frame
Hi Dennis, Thank you very nice :) Best regards, Giovanni On Oct 23, 2011, at 6:55 PM, Dennis Murphy wrote: > Hi: > > Here's one approach: > > # Function to process a list component into a data frame > ff <- function(x) { > data.frame(time = x[1], partitioning_mode = x[2], workload = x[3], >runtime = as.numeric(x[4:length(x)]) ) > } > > # Apply it to each element of the list: > do.call(rbind, lapply(data, ff)) > > or equivalently, using the plyr package, > > library('plyr') > ldply(data, ff) > > # Example: > L <- list(c("1", "sharding", "query", "607", "85", "52", "79", "77", > "67", "98"), > c("1", "sharding", "refresh", "2932", "2870", "2877", "2868"), > c("1", "replication", "query", "2891", "2907", "2922", "2937")) > do.call(rbind, lapply(L, ff)) > time partitioning_mode workload runtime > 1 1 shardingquery 607 > 2 1 shardingquery 85 > 3 1 shardingquery 52 > 4 1 shardingquery 79 > 5 1 shardingquery 77 > 6 1 shardingquery 67 > 7 1 shardingquery 98 > 8 1 sharding refresh2932 > 9 1 sharding refresh2870 > 101 sharding refresh2877 > 111 sharding refresh2868 > 121 replicationquery2891 > 131 replicationquery2907 > 141 replicationquery2922 > 151 replicationquery2937 > > HTH, > Dennis __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] unfold list (variable number of columns) into a data frame
Hello, I used R a lot one year ago and now I am a bit rusty :) I have my raw data which correspond to the list of runtimes per minute (minute "1" "2" "3" in two database modes "sharding" and "query" and two workload types "query" and "refresh") and as a list of char arrays that looks like this: > str(data) List of 122 $ : chr [1:163] "1" "sharding" "query" "607" "85" "52" "79" "77" "67" "98" ... $ : chr [1:313] "1" "sharding" "refresh" "2932" "2870" "2877" "2868" ... $ : chr [1:57] "1" "replication" "query" "2891" "2907" "2922" "2937" ... $ : chr [1:278] "1" "replication refresh "79" "79" "89" "79" "89" "79" "79" "79" ... $ : chr [1:163] "2" "sharding" "query" "607" "85" "52" "79" "77" "67" "98" ... $ : chr [1:313] "2" "sharding" "refresh" "2932" "2870" "2877" "2868" ... $ : chr [1:57] "2" "replication" "query" "2891" "2907" "2922" "2937" ... $ : chr [1:278] "2" "replication refresh "79" "79" "89" "79" "89" "79" "79" "79" ... $ : chr [1:163] "3" "sharding" "query" "607" "85" "52" "79" "77" "67" "98" ... $ : chr [1:313] "3" "sharding" "refresh" "2932" "2870" "2877" "2868" ... $ : chr [1:57] "3" "replication" "query" "2891" "2907" "2922" "2937" ... $ : chr [1:278] "3" "replication refresh "79" "79" "89" "79" "89" "79" "79" "79" ... I would like to transform the one above into a data frame where this structure in unfolded in the following way: 'data.frame': N obs. of 3 variables: $ time : int 1 1 1 1 1 1 1 1 1 1 1 ... $ partitioning_mode : chr "sharding" "sharding" "sharding" "sharding" "sharding" "sharding" "sharding" "sharding" "sharding" "sharding" ... $ workload : chr "query" "query" "query" "query" "query" "query" "query" "refresh" "refresh" "refresh" "refresh" ... $ runtime : num 607 85 52 79 77 67 98 2932 2870 2877 2868... So instead of having an associative array (variable number of columns) it should become a simple list where the group or factors are repeated for every occurrence of the specific runtime. Basically my ultimate goal is to get a data frame structure that is "summarizeBy"-friendly and "ggplot2-friendly" i.e. using this data frame format. Help greatly appreciated! TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] issue loading doBy library
Hi Josh, Thank you for your feedback, after lot of trial and error the problem is finally solved. To solve this problem, I tried in this order: 1) uninstalling the two packages "Matrix" and "lme4" and reinstalling them. 2) uninstalling doBy and reinstalling it with and without 1) 3) upgrading to the latest R version and re-doing 1) 2) And finally 4) wiped all R traces from my Mac OS X 10.7.2 and re-installed the latest version. The latest Matrix version seems to have changed in the meanwhile so I was lucky and now it works. It seems to me that the whole concept of dependency analysis for installing packages in R is broken ... seems like the packages depend only on the package name and not on the specific versions which is wrong as in this case, chances are that a user will say in this "live lock" where will never find a "happy together" versions of Matrix and lme4 ... but well, you statisticians know better about chances :P Thank you again. Best regards, Giovanni On Oct 23, 2011, at 3:12 AM, Joshua Wiley wrote: > Hi Giovanni, > > This is a dependency issue between lme4 and Matrix. There is > substantial discussion of this on the R sig mixed models list. A > simple update may fix the problem, or you may need to be a little bit > more precise about getting version of Matrix and lme4 that work with > each other. > > HTH, > > Josh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] issue loading doBy library
Hello, How can I fix this? I have the latest version of R 2.13.2 and I use Mac OS X 10.7.2 > library(doBy) Loading required package: lme4 Error in dyn.load(file, DLLpath = DLLpath, ...) : function 'cholmod_l_start' not provided by package 'Matrix' Error: package 'lme4' could not be loaded > library(lme4) Error in dyn.load(file, DLLpath = DLLpath, ...) : function 'cholmod_l_start' not provided by package 'Matrix' Error: package/namespace load failed for 'lme4' TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] update and rebuild all?
Hello, I upgraded my Mac R version to the newest 2.11.1, then I ran the option to update all packages but there was an error related to fetching one of those and the process stopped. I retried updating all packages but nothing happens. Although all my course project scripts work perfectly is there a way e.g. a command to manually fetch (most up to date version) and locally build all installed packages? to make sure it is all ok? I recall there was something like that but don't remember what command it was. Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plot for linear discriminant
Hello Hadley, Thank you very much for your help! I have just received your book btw :) On May 16, 2010, at 6:16 PM, Hadley Wickham wrote: >Hi Giovanni, > >Have a look at the classifly package for an alternative approach that >works for all classification algorithms. If you provided a small >reproducible example, I could step through it for you. > >Hadley Please find below a self contained example. I managed to complete the task using the graphics package. I would be curious to see how to get one of those really nice ggplot2 graphs with decision boundaries and class regions :) Thank you! Best regards, Giovanni # = # (1) Generate sample labelled data # = rm(list=ls()) # clear workspace library(mvtnorm) # needed for rmvnorm set.seed(11) # predictability of results sigma <- cbind(c(0.5, 0.3), c(0.3, 0.5)) # true covariance matrix mu <- matrix(0,nrow=3,ncol=2) mu[1,] <- c(3, 1.5) # true mean vectors mu[2,] <- c(4, 4) mu[3,] <- c(8.5, 2) x <- matrix(0, nrow = 300, ncol = 3) x[,3] <- rep(1:3, each = 100) # class labels x[1 :100,1:2] <- rmvnorm(n = 100, mean = mu[1,], sigma = sigma) # simulate data x[101:200,1:2] <- rmvnorm(n = 100, mean = mu[2,], sigma = sigma) x[201:300,1:2] <- rmvnorm(n = 100, mean = mu[3,], sigma = sigma) # = # (2) Plot the labelled data # = ## ## Function for plotting the data separated by classes, hacked out of predplot: ## http://stat.ethz.ch/teaching/lectures/FS_2010/CompStat/predplot.R ## plotclasses <- function(x, main = "", len = 200, ...) { xp <- seq(min(x[,1]), max(x[,1]), length=len) yp <- seq(min(x[,2]), max(x[,2]), length=len) grid <- expand.grid(xp, yp) colnames(grid) <- colnames(x)[-3] plot(x[,1],x[,2],col=x[,3],pch=x[,3],main=main,xlab='x_1',ylab='x_2') text(2.5,4.8,"Class 1",cex=.8) # class 1 text(4.2,1.0,"Class 2",cex=.8) # class 2 text(8.0,0.5,"Class 3",cex=.8) # class 3 } plotclasses(x) # = # (3) Functions needed: calculate separating hyperplane between two given # classes and converting hyperplanes to line equations for the p=2 case # = ## ## Returns the coefficients for the hyperplane that separates one class from another. ## Computes the coefficients according to the formula: ## $x^T\hat{\Sigma}^{-1}(\hat{\mu}_0-\hat{\mu}_1) - \frac{1}{2}(\hat{\mu}_0 + ## \hat{\mu}_1)^T\hat{\Sigma}^{-1}(\hat{\mu}_0-\hat{\mu}_1)+\log(\frac{p_0}{p_1})$ ## ## sigmainv(DxD) - precalculated sigma (covariance matrix) inverse ## mu1(1xD) - precalculated mu mean for class 1 ## mu2(1xD) - precalculated mu mean for class 2 ## prior1 - precalculated prior probability for class 1 ## prior2 - precalculated prior probability for class 2 ## ownldahyperplane <- function(sigmainv,mu1,mu2,prior1,prior2) { J <- nrow(mu) # number of classes b <- sigmainv%*%(mu1 - mu2) c <- -(1/2)*t(mu1 + mu2)%*%sigmainv%*%(mu1 - mu2) + log(prior1/prior2) return(list(b=b,c=c)) } ## ## Returns linear betas (intersect and slopes) for the given hyperplane structure. ## The structure is a list that matches the output of the function defined above. ## ownlinearize <- function(sephyp) { return(list(beta0=-sephyp$c/sephyp$b[2], # line slope and intersect beta1=-sephyp$b[1]/sephyp$b[2])) } # = # (4) Run lda # = library(MASS) # needed for lda/qda # read in a function
[R] abline limit constrain x-range how?
Hello, I managed to "linearize" my LDA decision boundaries now I would like to call abline three times but be able to specify the exact x range. I was reading the doc but it doesn't seem to support this use-case? are there alternatives. The reason why I use abline is because I first call plot to plot all the three datasets and then call abline to "append" these decision boundary lines to the existing plot ... Can anyone please advice? Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plot for linear discriminant
Hello, I have a labelled dataset with three classes. I have computed manually the LDA hyperplane that separate the classes from each other i.e. \hat{\delta}_j(x)=x^Tb_j + c_j where b_j \in \mathbb{R}^p and c_j \in \mathbb{R} my concrete b_j looks like e.g. b_j <- rbind(1,2) c_j <- 3 How can I plot y=x^Tb_j + c_j ?? two problems: 1- I need lines and the dimension of my x is 2 2- I would like the plotted lines to end when they intersect so they nicely show the decision boundaries Any pointers? maybe an example with ggplot2 I could not find any from the showcase documentation page ... Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plot formula 'x' is missing?
Hi Jorge and Dennis, Thank you for the hint! However, I'm still very intrigued as to why it does not work using plot ... what is special about this specific formula that plot doesn't like it? Best regards, Giovanni On May 15, 2010, at 7:12 AM, Jorge Ivan Velez wrote: > Hi Giovanni, > > curve(1/(1+exp(5.0993-0.1084*x)), 0, 100) > > HTH, > Jorge > > > On Sat, May 15, 2010 at 12:43 AM, Giovanni Azua <> wrote: > Hello, > > I'd like to plot the logistic function for a specific model like this: > > > plot(formula=y~1/(1+exp(5.0993-0.1084*x)),data=data.frame(x=seq(0,100,length.out=1000))) > Error in is.function(x) : 'x' is missing > > However, I get the 'x' is missing error above and don't know how to fix it ... > > Can anyone advice? > Thanks in advance, > Best regards, > Giovanni > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plot formula 'x' is missing?
Hello, I'd like to plot the logistic function for a specific model like this: > plot(formula=y~1/(1+exp(5.0993-0.1084*x)),data=data.frame(x=seq(0,100,length.out=1000))) Error in is.function(x) : 'x' is missing However, I get the 'x' is missing error above and don't know how to fix it ... Can anyone advice? Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plot with no default axis labels
Hello Jim, Very nice example! thank you! Best regards, Giovanni On May 14, 2010, at 11:50 AM, Jim Lemon wrote: > On 05/14/2010 07:31 PM, Giovanni Azua wrote: >> Hello, >> >> I could not find an easy way to have the plot function not display the >> default x and y-axis labels, I would like to customize it to show only >> points of interest ... I would like to: >> >> 1- call plot that show no x-axis and y-axis labels >> 2- call axis specifying the exact points of interest for the x and y-axis >> >> Maybe they can both be achieved in the plot function call but I can't find >> the right way to do it ... >> > Hi Giovanni > Let's get creative: > > # making up data can be fun! > mydata<-sort(runif(20)) > plot(mydata,main="A most creative plot", > xlab="",ylab="",type="b",axes=FALSE) > box() > axis(1,at=c(1,7,9,14,17)) > require(plotrix) > staxlab(2,at=c(0,0.3279614,0.4431966,0.729113,0.937461)) > > Jim > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plot with no default axis labels
Hello, I found the answer here: http://www.statmethods.net/advgraphs/axes.html basically plot(...,axes=FALSE,...) ## avoids default axis labels Best regards, Giovanni On May 14, 2010, at 11:31 AM, Giovanni Azua wrote: > Hello, > > I could not find an easy way to have the plot function not display the > default x and y-axis labels, I would like to customize it to show only points > of interest ... I would like to: > > 1- call plot that show no x-axis and y-axis labels > 2- call axis specifying the exact points of interest for the x and y-axis > > Maybe they can both be achieved in the plot function call but I can't find > the right way to do it ... > > Thanks in advance, > Best regards, > Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plot with no default axis labels
Hello, I could not find an easy way to have the plot function not display the default x and y-axis labels, I would like to customize it to show only points of interest ... I would like to: 1- call plot that show no x-axis and y-axis labels 2- call axis specifying the exact points of interest for the x and y-axis Maybe they can both be achieved in the plot function call but I can't find the right way to do it ... Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ggplot2's geom_errorbar legend
Hello Ista, On May 1, 2010, at 8:37 PM, Ista Zahn wrote: > Hi Giovanni, > A reproducible example would help. Also, since I think this will be > tricky, it might be a good idea to post it to the ggplot2 mailing list > (you can register at http://had.co.nz/ggplot2/ ). > > Best, > Ista First, thank you so much for showing me the way ... the plots generated from ggplot2 are super nice! Please find below a self contained reproducible example nearly exactly as the ones I need to produce. Thanks in advance, Best regards, Giovanni rm(list=ls()) # clear workspace library(ggplot2) # use ggplot2 library methodlabels <- c("Classic", "Own Bootstrap", # prepare plot labels "R Bootstrap") errortypelabels <- c("'Normal'", "'Student-t'",# prepare plot labels "'Exponential'") betalabels <- c("'Beta_1'", "'Beta_2'", "'Beta_3'")# prepare plot labels betas <- c(1, -2, 3) # my real parameters set.seed(11) S <- 100 B <- 50 P <- length(betas) classiccis <- owncis <- rbootcis <- array(data=0,dim=c(S,1,3,2)) for (p in 1:P) { classiccis[,1,p,1] <- rnorm(100,mean=betas[p]-1.0,sd=0.5) classiccis[,1,p,2] <- rnorm(100,mean=betas[p]+1.0,sd=0.5) owncis[,1,p,1] <- rnorm(100,mean=betas[p]-1.0,sd=0.5) owncis[,1,p,2] <- rnorm(100,mean=betas[p]+1.0,sd=0.5) rbootcis[,1,p,1] <- rnorm(100,mean=betas[p]-1.0,sd=0.5) rbootcis[,1,p,2] <- rnorm(100,mean=betas[p]+1.0,sd=0.5) } ## ## Function that generates the ggplot of the CI for a given parameter ## and error type ## ciplot <- function() { x <- 1:S y <- rep(betas[p],S) data <- data.frame(x,y) classiclimits <- aes(x=x[1:100],ymax = classiccis[1:100,e,p,1], ymin=classiccis[1:100,e,p,2]) ownlimits <- aes(x=x[1:100]+0.4,ymax = owncis[1:100,e,p,1], ymin=owncis[1:100,e,p,2]) rbootlimits <- aes(x=x[1:100]+0.8,ymax = rbootcis[1:100,e,p,1], ymin=rbootcis[1:100,e,p,2]) g1 <- ggplot(data, aes(x=x, y=y)) g1 + geom_errorbar(classiclimits, colour="red") + geom_errorbar( ownlimits, colour="green") + geom_errorbar(rbootlimits, colour="blue") + geom_hline(yintercept = betas[p]) + xlab( "Simulation") + ylab(betalabels[p]) + opts(title = paste("CI for", errortypelabels[e],"error, ",betalabels[p],",",S/5, "simulations and ",B,"bootstrapped samples")) } ## ## Execute one by one ... note they have to be global ## otherwise seems that ggplot won't find them. ## ## p - means parameter index ## e - means error type index ## p=1;e=1;ciplot() p=1;e=2;ciplot() p=1;e=3;ciplot() p=2;e=1;ciplot() p=2;e=2;ciplot() p=2;e=3;ciplot() p=3;e=1;ciplot() p=3;e=2;ciplot() p=3;e=3;ciplot() [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cbind and automatic type conversion
Hello, I have three method types and 100 generalization errors for each, all in the range [0.65,0.81]. I would like to make a stacked histogram plot using ggplot2 with this data ... Therefore I need a data frame of the form e.g. Method GE -- -- "Classic"0.76 "Classic"0.79 "Own Bootstrap" 0.81 "Own Bootstrap" 0.79 "R Bootstrap"0.71 "R Bootstrap"0.75 So I combine the data in the following way: normalerrors <- rbind(cbind(rep("Classic",S),classicge[,1]), cbind(rep("Own Bootstrap",S),ownge[,1]),cbind(rep("R Bootstrap",S),rbootge[,1])) normalerrors <- data.frame(method=factor(normalerrors[,1]),ge=normalerrors[,2]) But doing it in this way my GE coefficients get automatically converted to string type ... how can I avoid this conversion when doing the cbind? TIA, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] closest match in R to c-like struct?
On May 1, 2010, at 6:48 PM, steven mosher wrote: > I was talking with another guy on the list about this very topic. > > A simple example would help. > > first a sample C struct, and then how one would do the equivalent in R. > > In the end i suppose one want to do a an 'array' of these structs, or list > of the structs. Or like in my use-case ... I needed a c-like struct to define the type for aggregating the data to return from a function. Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] closest match in R to c-like struct?
On May 1, 2010, at 5:04 PM, (Ted Harding) wrote: > Well, 'list' must be pretty close! The main difference would be > that in C the structure type would be declared first, and then > applied to create an object with that structure, whereas an R > lists are created straight off. If you want to set up a generic > list type for a certain purpose, you would wrap its definition > in a function. > > Another difference is that R lacks the "pointer" type, so that > R's "mylist$component" is the equivalent of C's "mylist.component"; > I don't think you can do the equivalent in R of C's "mylist->component" > (though I'm likely to be wrong about that, and to be promptly corrected)! > > Hopingb this helps, > Ted. Thank you all for your help! this list is really great ... I mean the mailing list :) Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] bootstrap generalization error
Hello, I use the following function "bootstrapge" to calculate (and compare) the generalization error of several bootstrap implementations: ## ## Calculates and returns a coefficient corresponding to the generalization ## error. The formula for the bootstrap generalization error is: ## $N^{-1}\sum_{i=1}^n B^{-1}\sum_{j=1}^B |y_i - (\beta_n^{*j})^T x|$ ## ## x - mxn matrix where m is the number of model parameters and n is the ## number of observations ## y - n column-vector containing true values ## theta_star - mxn matrix where m is the number of bootstrapped samples ## and n is the number of model parameters ## bootstrapge <- function(x,y,theta_star) { B <- nrow(theta_star) P <- ncol(theta_star) t <- 0 for (b in 1:B) { t <- t + abs(y - rbind(theta_star[b,])%*%x) } return(mean(t/B)) } Is there a nicer/faster way to accomplish the same using implicit loop functions e.g. apply, sapply etc I could not figure it out ... Is there a way to get a similar coefficient using the boot library? I could not find any way to get such a "generalization error" so I can compare my implementation with that one ... TIA, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] closest match in R to c-like struct?
Hello, What would be in R the closest match to a c-struct? e.g. data.frame requires all elements to be of the same length ... or is there a way to circumvent this? TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] ggplot2's geom_errorbar legend
Hello, I create a simple ggplot that only shows a straight line. I then add three datasets of CI using the geom_errorbar function. The problem is that I can't find any way to have the legend showing up ... I need to show what each color of the CIs corresponds to i.e. which method. Can anyone advice please? TIA, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] apply question
Hello David, On Apr 30, 2010, at 11:00 PM, David Winsemius wrote: > Note: Loops may be just as fast or faster than apply calls. > How come!? is this true also for other similar functions: lapply, tapply and sapply? Then the only advantage of these above is only syntactic sugar? >> >> indices <- replicate(B,sample(1:N, size=N, replace=TRUE)) >> theta_star <- apply(indices,2,statistic,data) > > Why not: > > theta_star <- apply(indices,2, function(x) statistic(data, x) ) > > May end up as a data class that needs further work, depending on what > "statistic" return. > Nice! thank you! Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] apply question
Hello, I have a bootstrap implementation loop that I would like to replace by a faster "batch operator". The loop is as follows: for (b in 1:B) { indices <- sample(1:N, size=N, replace=TRUE) # sample n elements with replacement theta_star[b,] = statistic(data,indices) # execute statistic callback function } I would like to replace this by something like (does not work): indices <- replicate(B,sample(1:N, size=N, replace=TRUE)) theta_star <- apply(indices,2,statistic,data) This, however, does not work because the statistic function receives indices as second parameter and not as first parameter. While I could change the signature of my statistic function I would like to know how to achieve this without having to do so. TIA, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] ggplot2 legend how?
Hello, I have just ordered the "ggplot2: Elegant Graphics for Data Analysis (Use R)" but while it arrives :) can anyone please show me how to setup and add a simple legend to a ggplot? This is my use case, I need a legend showing CI "Classic", "Own bootstrap", "R bootstrap": library(ggplot2) e <- 1 p <- 1 x <- 1:S y <- rep(betas[p],S) data <- data.frame(x,y) classiclimits <- aes(x=x,ymax = classic[,e,p,1], ymin=classic[,e,p,2]) ownlimits <- aes(x=x+0.4,ymax = own[,e,p,1], ymin=own[,e,p,2]) rbootlimits <- aes(x=x+0.8,ymax = rboot[,e,p,1], ymin=rboot[,e,p,2]) g1 <- ggplot(data, aes(x, y)) g1 + geom_errorbar(classiclimits, colour = "red") + geom_errorbar(ownlimits, colour = "green") + geom_errorbar(rbootlimits, colour = "blue") + geom_hline(yintercept = betas[p]) + xlab("Simulation") + ylab("beta_1") + opts(title = "CI for error 'normal' and beta_1") + opts(legend.position = c(10,2.5)) Many thanks in advance, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SOLVED plotting multiple CIs
Hello, After installing gfortran from http://r.research.att.com/gfortran-4.2.3.dmg it finally works! see below. Thank you all. @Ista Zahn: Looks fantastic! :) thank you so much! ... is there a way to have a small circle on the true value? Best regards, Giovanni > install.packages("Hmisc", dependencies=TRUE, type ="source") trying URL 'http://cran.ch.r-project.org/src/contrib/Hmisc_3.7-0.tar.gz' Content type 'application/x-gzip' length 563054 bytes (549 Kb) opened URL == downloaded 549 Kb * installing *source* package ‘Hmisc’ ... ** libs ** arch - i386 gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/i386 -I/usr/local/include -fPIC -g -O2 -c Hmisc.c -o Hmisc.o gfortran -arch i386 -fPIC -g -O2 -c cidxcn.f -o cidxcn.o gfortran -arch i386 -fPIC -g -O2 -c cidxcp.f -o cidxcp.o gfortran -arch i386 -fPIC -g -O2 -c hoeffd.f -o hoeffd.o gfortran -arch i386 -fPIC -g -O2 -c jacklins.f -o jacklins.o gfortran -arch i386 -fPIC -g -O2 -c largrec.f -o largrec.o gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/i386 -I/usr/local/include -fPIC -g -O2 -c mChoice.c -o mChoice.o gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/i386 -I/usr/local/include -fPIC -g -O2 -c nstr.c -o nstr.o gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/i386 -I/usr/local/include -fPIC -g -O2 -c ranksort.c -o ranksort.o gfortran -arch i386 -fPIC -g -O2 -c rcorr.f -o rcorr.o gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/i386 -I/usr/local/include -fPIC -g -O2 -c string_box.c -o string_box.o gfortran -arch i386 -fPIC -g -O2 -c wclosest.f -o wclosest.o gcc -arch i386 -std=gnu99 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o Hmisc.so Hmisc.o cidxcn.o cidxcp.o hoeffd.o jacklins.o largrec.o mChoice.o nstr.o ranksort.o rcorr.o string_box.o wclosest.o -lgfortran -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation installing to /Library/Frameworks/R.framework/Versions/2.11/Resources/library/Hmisc/libs/i386 ** arch - x86_64 gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/x86_64 -I/usr/local/include-fPIC -g -O2 -c Hmisc.c -o Hmisc.o gfortran -arch x86_64 -fPIC -g -O2 -c cidxcn.f -o cidxcn.o gfortran -arch x86_64 -fPIC -g -O2 -c cidxcp.f -o cidxcp.o gfortran -arch x86_64 -fPIC -g -O2 -c hoeffd.f -o hoeffd.o gfortran -arch x86_64 -fPIC -g -O2 -c jacklins.f -o jacklins.o gfortran -arch x86_64 -fPIC -g -O2 -c largrec.f -o largrec.o gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/x86_64 -I/usr/local/include-fPIC -g -O2 -c mChoice.c -o mChoice.o gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/x86_64 -I/usr/local/include-fPIC -g -O2 -c nstr.c -o nstr.o gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/x86_64 -I/usr/local/include-fPIC -g -O2 -c ranksort.c -o ranksort.o gfortran -arch x86_64 -fPIC -g -O2 -c rcorr.f -o rcorr.o gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/x86_64 -I/usr/local/include-fPIC -g -O2 -c string_box.c -o string_box.o gfortran -arch x86_64 -fPIC -g -O2 -c wclosest.f -o wclosest.o gcc -arch x86_64 -std=gnu99 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/usr/local/lib -o Hmisc.so Hmisc.o cidxcn.o cidxcp.o hoeffd.o jacklins.o largrec.o mChoice.o nstr.o ranksort.o rcorr.o string_box.o wclosest.o -lgfortran -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation installing to /Library/Frameworks/R.framework/Versions/2.11/Resources/library/Hmisc/libs/x86_64 ** R ** inst ** preparing package for lazy loading Loading required package: splines ** help *** installing help indices ** building package indices ... ** testing if installed package can be loaded * DONE (Hmisc) The downloaded packages are in ‘/private/var/folders/5m/5mC0wYCSGMK8NbPZYhBEZE+++TI/-Tmp-/Rtmpz0R5nY/downloaded_packages’ Updating HTML index of packages in '.Library' > __ R-help@r-project.org mailing list https://stat.ethz.ch/m
Re: [R] plotting multiple CIs
Hello David, On Apr 30, 2010, at 6:00 PM, David Winsemius wrote: > Looks like you do not have the RTools bundle and perhaps not the XCode > framework either? > > I am not suggesting that you do so, since it appears you are not conversant > with compiling source code packages. If I am wrong about that point, then > this is where you would get the gfortran compiler that Simon uses: > > http://r.research.att.com/tools/ > > My advice would be to go back to R.2.10.1 if you need Rcmdr. > I do have Xcode latest but not RTools. This below was my last failed attempt. Best regards, Giovanni > install.packages("Hmisc", dependencies=TRUE, type ="source") trying URL 'http://cran.ch.r-project.org/src/contrib/Hmisc_3.7-0.tar.gz' Content type 'application/x-gzip' length 563054 bytes (549 Kb) opened URL == downloaded 549 Kb * installing *source* package Hmisc ... ** libs ** arch - i386 gcc -arch i386 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -I/Library/Frameworks/R.framework/Resources/include/i386 -I/usr/local/include -fPIC -g -O2 -c Hmisc.c -o Hmisc.o gfortran -arch i386 -fPIC -g -O2 -c cidxcn.f -o cidxcn.o make: gfortran: No such file or directory make: *** [cidxcn.o] Error 1 ERROR: compilation failed for package Hmisc * removing /Library/Frameworks/R.framework/Versions/2.11/Resources/library/Hmisc The downloaded packages are in /private/var/folders/5m/5mC0wYCSGMK8NbPZYhBEZE+++TI/-Tmp-/Rtmpz0R5nY/downloaded_packages Updating HTML index of packages in '.Library' Warning message: In install.packages("Hmisc", dependencies = TRUE, type = "source") : installation of package 'Hmisc' had non-zero exit status > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plotting multiple CIs
Hello Zahn, On Apr 30, 2010, at 4:35 PM, Ista Zahn wrote: > Hi Giovanni, > I think the ggplot2 package might help you out here. Do you want > something like this? Thank you for your suggestion however I could not give it a try since landed in the same issue being reported about the Hmisc package: > install.packages("Hmisc", dependencies=TRUE) Warning message: In getDependencies(pkgs, dependencies, available, lib) : package ‘Hmisc’ is not available > library(ggplot2) > library(Hmisc) Error in library(Hmisc) : there is no package called 'Hmisc' > > data(mtcars) > > p <- ggplot(mtcars, aes(x=cyl, y=mpg)) > > p + stat_summary(fun.data = "mean_cl_boot", colour = "red", geom = + "errorbar") + stat_summary(fun.data = "mean_cl_normal", colour = + "blue", geom = "errorbar") Error: Hmisc package required for this functionality. Please install and try again. > I am running Mac OS X Snow Leopard latest version and also have the latest R UI for Mac. How can Hmisc be installed manually? Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] plotting multiple CIs
Hello, I need to plot multiple confidence intervals for the same model parameter e.g. so for the same value of the parameter in point x_1 I would like to see four different confidence intervals so that I can compare the accuracy e.g. boot basic vs normal vs my own vs classic lm CI etc. I like very very much the plotCI implemented here: http://cran.r-project.org/web/packages/plotrix/index.html This is their example: library(plotrix) y <- runif(10) err <- runif(10) plotCI(x=1:10,y=y,uiw=err,liw=2*err,lwd=2,col="red",scol="blue",main="Add colors to the points and error bars") but does not seem to support plotting multiple CIs but only one per point, is there a similar library somewhere but having the possibility to plot multiple CIs? I know there is the function "segment" but I like this one above more :) Thanks in advance, Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] function pointer question
Hello Jan, On Apr 26, 2010, at 8:56 AM, Jan van der Laan wrote: > You can use the '...' for that, as in: > > loocv <- function(data, fnc, ...) { > n <- length(data.x) > score <- 0 > for (i in 1:n) { > x_i <- data.x[-i] > y_i <- data.y[-i] > yhat <- fnc(x=x_i,y=y_i, ...) > score <- score + (y_i - yhat)^2 > } > score <- score/n > return(score) > } > > scoreks <- loocv(data,gaussiankernel, h=0.5) It works nicely, thank you! Best regards, Giovanni __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] function pointer question
The beauty of trial and error ... if I leave the non x, y parameters i.e. h as global parameters rather than formal parameters for gaussiankernel it works fine basically I don't pass anymore h=0.5 to gaussiankernel but consume it from a global variable. Ugly but works ... Best regards, Giovanni On Apr 26, 2010, at 1:38 AM, Giovanni Azua wrote: > Hello, > > I have the following function that receives a "function pointer" formal > parameter name "fnc": > > loocv <- function(data, fnc) { > n <- length(data.x) > score <- 0 > for (i in 1:n) { > x_i <- data.x[-i] > y_i <- data.y[-i] > yhat <- fnc(x=x_i,y=y_i) > score <- score + (y_i - yhat)^2 > } > score <- score/n > return(score) > } > > I would like to use it like this: > > ## > ## Estimator function using Gaussian Kernel > ## > gaussiankernel <- function(x,y,h) { > modelks <- ksmooth(x,y,kernel="normal",bandwidth=h,x.points=x) > yhat <- modelks$y > return(yhat) > } > > scoreks <- loocv(data,gaussiankernel(h=0.5)) > > I expected this to work but it doesn't :( basically I wanted to take > advantage of the named parameters so I could pass the partially specified > function parameter "gaussiankernel" to loocv specifying only the h parameter > and then let loocv specify the remaining parameters as needed ... can this be > tweaked to work? The idea is to have loocv generic so it can work for any > estimator implementation ... > > I have more than 6 books now in R and none explains this important concept. > > Thanks in advance, > Best regards, > Giovanni > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] function pointer question
Hello, I have the following function that receives a "function pointer" formal parameter name "fnc": loocv <- function(data, fnc) { n <- length(data.x) score <- 0 for (i in 1:n) { x_i <- data.x[-i] y_i <- data.y[-i] yhat <- fnc(x=x_i,y=y_i) score <- score + (y_i - yhat)^2 } score <- score/n return(score) } I would like to use it like this: ## ## Estimator function using Gaussian Kernel ## gaussiankernel <- function(x,y,h) { modelks <- ksmooth(x,y,kernel="normal",bandwidth=h,x.points=x) yhat <- modelks$y return(yhat) } scoreks <- loocv(data,gaussiankernel(h=0.5)) I expected this to work but it doesn't :( basically I wanted to take advantage of the named parameters so I could pass the partially specified function parameter "gaussiankernel" to loocv specifying only the h parameter and then let loocv specify the remaining parameters as needed ... can this be tweaked to work? The idea is to have loocv generic so it can work for any estimator implementation ... I have more than 6 books now in R and none explains this important concept. Thanks in advance, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] interpreting acf plot
Hello Denis, (1) I appreciate your feedback, however, I feel I have all the right to ask a specific question related R namely what's the interpretation of the acf function plot. I gave away the information that it is a homework because many times people before helping ask what's the context for the question at hand. If I don't understand something I will for sure ask. I don't have anything to hide so I don't care if there are professors subscribed to this list. My ultimate goal is to learn and it doesn't really matter whether it is studying a book, asking an assistant or asking in a forum. (2) After looking in many references and not finding any clue ... I Googled for information and found that I should be "looking for cyclic patterns" i.e. oscillations? There are none in this dataset so I presume there would not be any autocorrelation, oder? (3) This is something very unfortunate ... the course Lectures are great, the course script is very comprehensive, however, the assignments many times include questions that are a bit off topic like in this case of Time Series and includes no actual reference ... so it is no surprise that even after diligently attending all lectures and doing all exercises I get stuck. Please recommend what's the best book in this topic of Time Series analysis maybe with R. I will buy it. (4) Yes they mentioned something like this in the assignment "Dependency can be verified by showing that under the model, Cov(X_t^2,X_{t-h}^2) \neq 0, h > 0 (complicated). Plot and interpret the autocorrelation functions of X_t and X_t^2 for the BMW-dataset." http://stat.ethz.ch/teaching/lectures/FS_2010/CompStat/series4.pdf Thank you. Best regards, Giovanni On Apr 17, 2010, at 7:25 PM, Dennis Murphy wrote: > Hi: > > (1) If you read the posting guide (as you were asked to do before posting), > you would find out > rather quickly that this is not a forum to help you with homework. > Moreover, several of your > professors may subscribe to this list and notice your request. > (2) What 'trend' is in this data set? It has excessive variation at certain > points of the series, > but what trend? > (3) None of your cited references is likely to have much that describes what > autocorrelation means. > (The only exception might be HSAUR, but it focuses more on the > programming than the concepts.) > I'd consult an actual time series text to learn the concepts you need > to make sense of the plots. > (4) You can't 'prove' that the series in question is (or is not) > autocorrelated with R or any other > software; however, it can be used to provide empirical support for one > hypothesis or the other. > Proof is a mathematical construct involving deductive logic, whereas > statistical inference uses > inductive logic. They represent different approaches to problem solving. > > HTH, > Dennis [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] interpreting acf plot
Hello, I am attending a course in Computational Statistics at ETH and in one of the assignments I am asked to prove that a time series is not autocorrelated using the R function "acf". I tried out the acf function with the given data, according to what I found here: http://landshape.org/enm/options-for-acf-in-r/ this test data does not look IID but rather shows some trends so how can I then prove that it is not autocorrelated? maybe the trends are ok? I have bought several titles on R but none really explains autocorrelation or how to interpret the acf function ... the integrated help is also a bit dry. These are the books I have on R: - Introductory Statistics with R (Springer) - A handbook of Statistical Analyses using R (CRC) - R in a Nutshell (Oreilly) - Statistical Computing with R (CRC) Thanks in advance, Best regards, Giovanni # = # Computational Statistics # Series 4 # Author: Giovanni Azua # Date: 16 April 2010 # = rm(list=ls()) # clear workspace # = # EXERCISE 1.(c) # = # load dataset from web #bmwlr <- scan("http://stat.ethz.ch/Teaching/Datasets/bmw.dat";) # load dataset from file bmwlr <- scan("/Users/bravegag/code/compstats/bmw.dat") par(mfrow=c(1,2)) # visualize two plots acf(bmwlr, lag.max = 10) acf(bmwlr^2, lag.max = 10) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] create space for a matrix
Hi Leo, see the matrix function e.g. m <- matrix(0, nrow=1, ncol=3) then you can use functions like rbind or cbind to create bigger ones. I am a newbie so double check everything :) HTH, Best regards, Giovanni On Mar 29, 2010, at 8:37 AM, leobon wrote: > > Hello all, > I want to creat a space for a matrix. > For example, I have a matrix with the dimension of 1(row)*3(col), which is > (2,3,4). Then I want to put this matrix into a new and larger matrix. The > dimension of the new matrix is 3*3. So what I want is creating a new larger > matrix which allows me to put each smaller matrix with the same columns into > it. > I know 0*(1:3) is the way to creat a space for 3 numbers. Could any body > tell me how to creat a space for a matrix? > Appreciate very much for your help! > > Leo > -- > View this message in context: > http://n4.nabble.com/create-space-for-a-matrix-tp1694754p1694754.html > Sent from the R help mailing list archive at Nabble.com. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] data fitting and confidence band
Hello, I am fitting data using different methods e.g. Local Polynomial and Smoothing splines. The data is generated out of a true function model with added normally distributed noise. I would like to know "how often the confidence band for all points simultaneously contain all true values". I can answer the question for one point in the following way: e.g. # = # How many times the pointwise confidence interval at x=0.5 contains the true value at 0.5 # i.e. what is the so called "coverage rate"? # = pos = which(x==0.5) sum(abs(estlp[pos,] - m(x[pos])) <= 1.96*selp[pos,]) # equidistant x outputs 946 # non-equidistant x outputs 938 sum(abs(estss[pos,] - m(x[pos])) <= 1.96*sess[pos,]) # equidistant x outputs 895 # non-equidistant x outputs 936 This basically tells me that out of 1000 simulation runs with different random noise, 946 times the true value i.e. m(x) for x=0.5 is contained within the 95% confidence interval. The estlp Local Polynomial performs better than Smoothing Splines under this criteria ... Now is there any specific way to answer "how often the confidence band for all points simultaneously contain all true values" other than this below? # = # How often does the confidence band for all points simultaneously contain all true values? # = sum(abs(estlp[,] - m(x[])) <= 1.96*selp[,])# equidistant x outputs 92560 # non-equidistant x outputs 92109 sum(abs(estss[,] - m(x[])) <= 1.96*sess[,])# equidistant x outputs 90804 # non-equidistant x outputs 94641 Is there a dedicated function in R for this purpose i.e. to build confidence bands around a given fit ... maybe a way to plot it nicely too given that the Estimated SE are calculated. Many thanks in advance, Best regards, Giovanni [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.