Le vendredi 16 avril 2010 à 00:15 -0800, Kay Cichini a écrit : > thanks thierry, > > i considered this transformations already, but variance is not stabilized > and/or normality is neither achieved. > i guess i'll have to look out for non-parametrics?
Or (maybe) a model based on a non-Gaussian likelihood ? A beta distribution comes to mind, either fitted by maximum likelihood or (if relevant prior information is available) in a Bayesian framework ? But beware : you have a not-so-small problem ... Your data have zeroes and ones, which, if you have no information on a "sample size", are "sharp" zeroes and ones, and there therefore theoretically bound to infinite linear predictors (in plain English : bloody unlikely). These values make a "fixed effect" analysis impossible : these points "at infinite" will make regression essentially impossible. Consider : > logit<-function(x)log(x/(1-x)) > ilogit<-function(x)1/(1+exp(-x)) > ilogit(coef(lm(logit(MH.Index)~0+stage,data=similarity))) Erreur dans lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf dans un appel à une fonction externe (argument 4) You need to have some information on the precision of your zeroes and ones nd use it to get some "possible" values of MH.Index One might be tempted to "unround" them by a small amount (representing a "reasonable guess" on your precision : > epsilon=0.01 > ilogit(coef(lm(logit(MH.Index)~0 +stage,data=within(similarity,{MH.Index<-pmax(epsilon,pmin(1-epsilon,MH.Index))})))) stageA stageB stageC stageD 0.6490997 0.2914323 0.5087639 0.5721789 BUT : > epsilon=0.000001 > ilogit(coef(lm(logit(MH.Index)~0 +stage,data=within(similarity,{MH.Index<-pmax(epsilon,pmin(1-epsilon,MH.Index))})))) stageA stageB stageC stageD 0.6588222 0.1020177 0.5087639 0.5721789 The estimation for stageB depends critically of the "unrounding" amount chosen. I tried a small fixed effect logit model in JAGS : it won't initialize with the original data (O and 1 are effectively impossible with possible values of the beta coefficients), and seems to exhibit, the same kind of sensitivity to "unrounding" amount that the linear model : LogitModFix<-local({ Modele<-function(){ for(k in 1:nobs) { ## logit(MH.Index[k])<-lpi.i[k] lpi.i[k]~dnorm(lpi[stage[k]], tau.lpi[stage[k]]) } for(i in 1:nstage) { lpi[i]~dnorm(0,1.0E-6) tau.lpi[i]<-pow(sigma.lpi[i],-2) sigma.lpi[i]~dunif(0,100) pi[i]<-ilogit(lpi[i]) } } Data<-function() { for (i in 1:nobs) { lpi.i[i]<-logit(MH.Index[i]) } } tmpf<-tempfile() ## write.model has been shoplifted from R2WinBUGS and adapted to JAGS ## by allowing a "data" argument for transformations write.model(Modele,tmpf, data=Data) epsilon<-0.00001 Modele.jags<-jags.model(tmpf, data=with(similarity, list(MH.Index=pmax(epsilon,pmin(1-epsilon,MH.Index)), ## MH.Index=MH.Index, stage=stage, nobs=nrow(similarity), nstage=nlevels(stage))), n.chains=3) unlink(tmpf) Modele.coda<-coda.samples(Modele.jags, variable.names=c("deviance", "pi", "sigma.lpi"), n.iter=1000) list(Modele.jags, Modele.coda) }) ## Convergence (not shown) is quite acceptable > summary(LogitModFix[[2]]) Iterations = 1001:2000 Thinning interval = 1 Number of chains = 3 Sample size per chain = 1000 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE deviance 660.5656 4.71697 0.0861196 0.1672079 pi[1] 0.6304 0.21666 0.0039557 0.0040347 pi[2] 0.1506 0.08036 0.0014671 0.0014460 pi[3] 0.5082 0.03956 0.0007222 0.0006777 pi[4] 0.5696 0.07473 0.0013644 0.0012952 sigma.lpi[1] 6.9409 0.86442 0.0157821 0.0287626 sigma.lpi[2] 4.2775 0.49281 0.0089974 0.0132247 sigma.lpi[3] 1.0875 0.11506 0.0021006 0.0026048 sigma.lpi[4] 0.8801 0.29371 0.0053623 0.0118630 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% deviance 654.0216 657.13069 659.6788 662.9535 672.2101 pi[1] 0.1695 0.47600 0.6681 0.8107 0.9485 pi[2] 0.0413 0.09206 0.1345 0.1914 0.3564 pi[3] 0.4320 0.48079 0.5078 0.5366 0.5851 pi[4] 0.4154 0.52518 0.5722 0.6168 0.7135 sigma.lpi[1] 5.5553 6.33366 6.8287 7.4412 8.8207 sigma.lpi[2] 3.5082 3.93440 4.2274 4.5598 5.3497 sigma.lpi[3] 0.8833 1.00397 1.0805 1.1565 1.3412 sigma.lpi[4] 0.5304 0.67740 0.8189 1.0067 1.5987 The same model re-fitted with epsilon=0.01 gives : > summary(LogitModFix[[2]]) Iterations = 1001:2000 Thinning interval = 1 Number of chains = 3 Sample size per chain = 1000 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE deviance 534.2639 4.42164 0.0807278 0.122396 pi[1] 0.6396 0.10877 0.0019859 0.001892 pi[2] 0.2964 0.06592 0.0012036 0.001163 pi[3] 0.5083 0.04120 0.0007522 0.000805 pi[4] 0.5702 0.07170 0.0013091 0.001221 sigma.lpi[1] 3.0387 0.36356 0.0066376 0.008803 sigma.lpi[2] 2.0519 0.22665 0.0041381 0.005358 sigma.lpi[3] 1.0928 0.12531 0.0022879 0.003305 sigma.lpi[4] 0.8699 0.26650 0.0048656 0.010146 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% deviance 527.9142 530.9711 533.4347 536.6295 544.7738 pi[1] 0.4130 0.5683 0.6429 0.7195 0.8251 pi[2] 0.1826 0.2511 0.2909 0.3373 0.4390 pi[3] 0.4266 0.4809 0.5093 0.5359 0.5892 pi[4] 0.4183 0.5254 0.5701 0.6150 0.7106 sigma.lpi[1] 2.4362 2.7772 2.9965 3.2600 3.8395 sigma.lpi[2] 1.6606 1.8935 2.0320 2.1811 2.5598 sigma.lpi[3] 0.8803 1.0035 1.0754 1.1711 1.3670 sigma.lpi[4] 0.5022 0.6901 0.8198 0.9906 1.5235 One should also note that the deviance is *extremely* sensitive to the choice of epsilon, whic tends to indicate that, at small epsilon values, the "sharp" "impossible" values dominate the evaluation of the oefficients they are involved in. Beta models (not shown) give similar results, to a lesser extend. In short, your "sharp" data are essentially incompatible with your model, which can't be fitted unless you get at least a rough idea of their precision. HTH, Emmanuel Charpentier, DDS, MSc > best regards, > kay ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.