Re: [R-sig-phylo] best fit vs normality of residuals
You got it perfectly Luke, Many thanks for your detailed answer! Cheers, Agus Re: [R-sig-phylo] best fit vs normality of residualshttps://www.mail-archive.com/search?l=r-sig-phylo@r-project.orgq=subject:%22Re%3A+%5BR-sig-phylo%5D+best+fit+vs+normality+of+residuals%22 Luke Matthewshttps://www.mail-archive.com/search?l=r-sig-phylo@r-project.orgq=from:%22Luke+Matthews%22 Tue, 03 Dec 2013 08:33:36 -0800https://www.mail-archive.com/search?l=r-sig-phylo@r-project.orgq=date:20131203 Hi Agus, If I understand your post correctly, you implemented the two models with exactly the same formula, and phylogenetic tree, and varied only the transform applied to the variables. In one case the transform was to 'scale and center' both the dependent and independent variables while in the other case you rank ordered the all variables. Please clarify if I have this wrong. In this case you would have to compare the normality of the residuals, as when you transform the dependent variable in different ways I think the likelihoods will necessarily shift just because the distribution of the data being explained is shifted. I don't think you can compare likelihoods, AICs, or BICs in this way when you alter the dependent, just as you can't compare likelihood-based values for models that include mostly the same data points but also some different ones. You can only compare likelihood based values for models that have the exact same dependent data but differ in the formulation of the model's independent variables, autocorrelation parameters, etc. As another point, although ranking seemed to fix your normality problem in this case, it may be introducing some other issues. Ranked variables are usually too flat for linear regression analysis, and perhaps the test of residual normality you are using only detects some deviations like skewness (I'm not familiar with the Liliefors test). I would recommend that instead of ranking you plot the distribution of your dependent and independent variables and try the transforms (log, arcsine, etc.) that a standard stats book would recommend given the appearance of each distribution. Some variables may not need transforming at all. It may be just one or two that is causing the problem in the model residuals, in which case you should transform only those variables. Remember that since it's the normality of the residuals that matters, it may be more appropriately fixed by transforming an independent variable than a dependent one. If the dependent itself is normal, then nonnormal! ity might be introduced in the residuals because of the distribution of a particular independent variable. Best Luke 2013/11/29 Agus Camacho agus.cama...@gmail.com Dear colleagues, Im having difficulties to decide whether I choose a phylogenetic GLS model with a higher fit (lower AIC and BIC), or a model in which normality of the residuals, after accounting for phylogenetic signal, is compromised. The number of species is reasonably high (87), but i dont know if that would justify for allowing a highly significant deviation of normality. When using scaled and centered data, i get: AIC BIC logLik 255.2029505 269.5696455 -121.6014753 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda -0.03313647856 Coefficients: Value Std.Error t-value p-value (Intercept)0.0432999895 0.06632562733 0.6528395024 0.5157 X 0.0425258358 0.03552018760 1.1972300459 0.2347 X2 0.4620358585 0.18471478739 2.5013474287 0.0144 X1:X2 -0.1211398020 0.04969892007 -2.4374735269 0.0170 Liliefors test (thanks Liam for posting on this) gave me: D = 0.1815, p-value = 2.558e-07 I ranked both, the response variable and the factors. My variables had some zeros and in some cases negative values, so thought that would be the simplest and most robust way. But i might be wrong. When ranking all variables: AIC BIC logLik 766.7826784 781.1493734 -377.3913392 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda 0.1434096557 Coefficients: Value Std.Error t-value p-value (Intercept) 5.615576688 9.195445483 0.610691097 0.5431 X1 0.477054032 0.200882571 2.374790556 0.0199 X2 0.771914482 0.208616720 3.700156356 0.0004 X1:x2 -0.007371999 0.004035148 -1.826946400 0.0714 Lilliefors (Kolmogorov-Smirnov) normality test data: chol(solve(vcv(tree))) %*% residuals(M2) D = 0.0545, p-value = 0.7709 Would anybody have a hint on this? Gracias! Agus -- Agustín Camacho Guerrero. Doutor em Zoologia. Laboratório de Herpetologia, Departamento de Zoologia, Instituto de Biociências, USP. Rua do Matão, trav.
Re: [R-sig-phylo] best fit vs normality of residuals
Hi Agus, If I understand your post correctly, you implemented the two models with exactly the same formula, and phylogenetic tree, and varied only the transform applied to the variables. In one case the transform was to 'scale and center' both the dependent and independent variables while in the other case you rank ordered the all variables. Please clarify if I have this wrong. In this case you would have to compare the normality of the residuals, as when you transform the dependent variable in different ways I think the likelihoods will necessarily shift just because the distribution of the data being explained is shifted. I don't think you can compare likelihoods, AICs, or BICs in this way when you alter the dependent, just as you can't compare likelihood-based values for models that include mostly the same data points but also some different ones. You can only compare likelihood based values for models that have the exact same dependent data but differ in the formulation of the model's independent variables, autocorrelation parameters, etc. As another point, although ranking seemed to fix your normality problem in this case, it may be introducing some other issues. Ranked variables are usually too flat for linear regression analysis, and perhaps the test of residual normality you are using only detects some deviations like skewness (I'm not familiar with the Liliefors test). I would recommend that instead of ranking you plot the distribution of your dependent and independent variables and try the transforms (log, arcsine, etc.) that a standard stats book would recommend given the appearance of each distribution. Some variables may not need transforming at all. It may be just one or two that is causing the problem in the model residuals, in which case you should transform only those variables. Remember that since it's the normality of the residuals that matters, it may be more appropriately fixed by transforming an independent variable than a dependent one. If the dependent itself is normal, then nonnormal! ity might be introduced in the residuals because of the distribution of a particular independent variable. Best Luke Luke J. Matthews | Senior Scientific Director | Activate Networks -- Message: 2 Date: Fri, 29 Nov 2013 12:29:06 -0200 From: Agus Camacho agus.cama...@gmail.com To: r-sig-phylo@r-project.org r-sig-phylo@r-project.org Subject: [R-sig-phylo] best fit vs normality of residuals Message-ID: calsj7pssmsg5hp7yiiesquejnuqad2zuprqs4ldx6wxagyx...@mail.gmail.com Content-Type: text/plain Dear colleagues, Im having difficulties to decide whether I choose a phylogenetic GLS model with a higher fit (lower AIC and BIC), or a model in which normality of the residuals, after accounting for phylogenetic signal, is compromised. The number of species is reasonably high (87), but i dont know if that would justify for allowing a highly significant deviation of normality. When using scaled and centered data, i get: AIC BIC logLik 255.2029505 269.5696455 -121.6014753 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda -0.03313647856 Coefficients: Value Std.Error t-value p-value (Intercept)0.0432999895 0.06632562733 0.6528395024 0.5157 X 0.0425258358 0.03552018760 1.1972300459 0.2347 X2 0.4620358585 0.18471478739 2.5013474287 0.0144 X1:X2 -0.1211398020 0.04969892007 -2.4374735269 0.0170 Liliefors test (thanks Liam for posting on this) gave me: D = 0.1815, p-value = 2.558e-07 I ranked both, the response variable and the factors. My variables had some zeros and in some cases negative values, so thought that would be the simplest and most robust way. But i might be wrong. When ranking all variables: AIC BIC logLik 766.7826784 781.1493734 -377.3913392 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda 0.1434096557 Coefficients: Value Std.Error t-value p-value (Intercept) 5.615576688 9.195445483 0.610691097 0.5431 X1 0.477054032 0.200882571 2.374790556 0.0199 X2 0.771914482 0.208616720 3.700156356 0.0004 X1:x2 -0.007371999 0.004035148 -1.826946400 0.0714 Lilliefors (Kolmogorov-Smirnov) normality test data: chol(solve(vcv(tree))) %*% residuals(M2) D = 0.0545, p-value = 0.7709 Would anybody have a hint on this? Gracias! Agus -- Agust?n Camacho Guerrero. Doutor em Zoologia. Laborat?rio de Herpetologia, Departamento de Zoologia, Instituto de Bioci?ncias, USP. Rua do Mat?o, trav. 14, n? 321, Cidade Universit?ria, S?o Paulo - SP, CEP: 05508-090, Brasil. [[alternative HTML version deleted
Re: [R-sig-phylo] best fit vs normality of residuals
Good advice! Cheers, Ted From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on behalf of Luke Matthews [lmatth...@activatenetworks.net] Sent: Tuesday, December 03, 2013 8:29 AM To: r-sig-phylo@r-project.org Subject: Re: [R-sig-phylo] best fit vs normality of residuals Hi Agus, If I understand your post correctly, you implemented the two models with exactly the same formula, and phylogenetic tree, and varied only the transform applied to the variables. In one case the transform was to 'scale and center' both the dependent and independent variables while in the other case you rank ordered the all variables. Please clarify if I have this wrong. In this case you would have to compare the normality of the residuals, as when you transform the dependent variable in different ways I think the likelihoods will necessarily shift just because the distribution of the data being explained is shifted. I don't think you can compare likelihoods, AICs, or BICs in this way when you alter the dependent, just as you can't compare likelihood-based values for models that include mostly the same data points but also some different ones. You can only compare likelihood based values for models that have the exact same dependent data but differ in the formulation of the model's independent variables, autocorrelation parameters, etc. As another point, although ranking seemed to fix your normality problem in this case, it may be introducing some other issues. Ranked variables are usually too flat for linear regression analysis, and perhaps the test of residual normality you are using only detects some deviations like skewness (I'm not familiar with the Liliefors test). I would recommend that instead of ranking you plot the distribution of your dependent and independent variables and try the transforms (log, arcsine, etc.) that a standard stats book would recommend given the appearance of each distribution. Some variables may not need transforming at all. It may be just one or two that is causing the problem in the model residuals, in which case you should transform only those variables. Remember that since it's the normality of the residuals that matters, it may be more appropriately fixed by transforming an independent variable than a dependent one. If the dependent itself is normal, then nonnormal! ity might be introduced in the residuals because of the distribution of a particular independent variable. Best Luke Luke J. Matthews | Senior Scientific Director | Activate Networks -- Message: 2 Date: Fri, 29 Nov 2013 12:29:06 -0200 From: Agus Camacho agus.cama...@gmail.com To: r-sig-phylo@r-project.org r-sig-phylo@r-project.org Subject: [R-sig-phylo] best fit vs normality of residuals Message-ID: calsj7pssmsg5hp7yiiesquejnuqad2zuprqs4ldx6wxagyx...@mail.gmail.com Content-Type: text/plain Dear colleagues, Im having difficulties to decide whether I choose a phylogenetic GLS model with a higher fit (lower AIC and BIC), or a model in which normality of the residuals, after accounting for phylogenetic signal, is compromised. The number of species is reasonably high (87), but i dont know if that would justify for allowing a highly significant deviation of normality. When using scaled and centered data, i get: AIC BIC logLik 255.2029505 269.5696455 -121.6014753 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda -0.03313647856 Coefficients: Value Std.Error t-value p-value (Intercept)0.0432999895 0.06632562733 0.6528395024 0.5157 X 0.0425258358 0.03552018760 1.1972300459 0.2347 X2 0.4620358585 0.18471478739 2.5013474287 0.0144 X1:X2 -0.1211398020 0.04969892007 -2.4374735269 0.0170 Liliefors test (thanks Liam for posting on this) gave me: D = 0.1815, p-value = 2.558e-07 I ranked both, the response variable and the factors. My variables had some zeros and in some cases negative values, so thought that would be the simplest and most robust way. But i might be wrong. When ranking all variables: AIC BIC logLik 766.7826784 781.1493734 -377.3913392 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda 0.1434096557 Coefficients: Value Std.Error t-value p-value (Intercept) 5.615576688 9.195445483 0.610691097 0.5431 X1 0.477054032 0.200882571 2.374790556 0.0199 X2 0.771914482 0.208616720 3.700156356 0.0004 X1:x2 -0.007371999 0.004035148 -1.826946400 0.0714 Lilliefors (Kolmogorov-Smirnov) normality test data: chol(solve(vcv(tree))) %*% residuals(M2) D = 0.0545, p-value = 0.7709 Would anybody have a hint on this? Gracias! Agus
[R-sig-phylo] best fit vs normality of residuals
Dear colleagues, Im having difficulties to decide whether I choose a phylogenetic GLS model with a higher fit (lower AIC and BIC), or a model in which normality of the residuals, after accounting for phylogenetic signal, is compromised. The number of species is reasonably high (87), but i dont know if that would justify for allowing a highly significant deviation of normality. When using scaled and centered data, i get: AIC BIC logLik 255.2029505 269.5696455 -121.6014753 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda -0.03313647856 Coefficients: Value Std.Error t-value p-value (Intercept)0.0432999895 0.06632562733 0.6528395024 0.5157 X 0.0425258358 0.03552018760 1.1972300459 0.2347 X2 0.4620358585 0.18471478739 2.5013474287 0.0144 X1:X2 -0.1211398020 0.04969892007 -2.4374735269 0.0170 Liliefors test (thanks Liam for posting on this) gave me: D = 0.1815, p-value = 2.558e-07 I ranked both, the response variable and the factors. My variables had some zeros and in some cases negative values, so thought that would be the simplest and most robust way. But i might be wrong. When ranking all variables: AIC BIC logLik 766.7826784 781.1493734 -377.3913392 Correlation Structure: corPagel Formula: ~1 Parameter estimate(s): lambda 0.1434096557 Coefficients: Value Std.Error t-value p-value (Intercept) 5.615576688 9.195445483 0.610691097 0.5431 X1 0.477054032 0.200882571 2.374790556 0.0199 X2 0.771914482 0.208616720 3.700156356 0.0004 X1:x2 -0.007371999 0.004035148 -1.826946400 0.0714 Lilliefors (Kolmogorov-Smirnov) normality test data: chol(solve(vcv(tree))) %*% residuals(M2) D = 0.0545, p-value = 0.7709 Would anybody have a hint on this? Gracias! Agus -- Agustín Camacho Guerrero. Doutor em Zoologia. Laboratório de Herpetologia, Departamento de Zoologia, Instituto de Biociências, USP. Rua do Matão, trav. 14, nº 321, Cidade Universitária, São Paulo - SP, CEP: 05508-090, Brasil. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/