Re: [R] Consistency of Logistic Regression
On 12.11.2010 20:11, Marc Schwartz wrote: You are not creating your data set properly. Your 'mat' is: mat column1 column2 11 0 21 0 30 1 40 0 51 1 61 0 71 0 80 1 90 0 10 1 1 What you really want is: DF- data.frame(y = c(1,0,1,0,0,1,0,0,1,1), x = c(5,4,1,6,3,6,5,3,7,9)) Actually it is in general safer to have a factor y rather than numeric y for classification tasks. Best, Uwe DF y x 1 1 5 2 0 4 3 1 1 4 0 6 5 0 3 6 1 6 7 0 5 8 0 3 9 1 7 10 1 9 MOD- glm(y ~ x, data = DF, family = binomial) summary(MOD) Call: glm(formula = y ~ x, family = binomial, data = DF) Deviance Residuals: Min 1Q Median 3Q Max -1.3353 -1.0229 -0.1239 0.9956 1.7477 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -1.6118 1.7833 -0.9040.366 x 0.3293 0.3383 0.9730.330 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.863 on 9 degrees of freedom Residual deviance: 12.767 on 8 degrees of freedom AIC: 16.767 Number of Fisher Scoring iterations: 4 HTH, Marc Schwartz On Nov 12, 2010, at 12:56 PM, Benjamin Godlove wrote: I think it is likely I am missing something. Here is a very simple example: R code: mat- matrix(nrow = 10, ncol = 2, c(1,0,1,0,0,1,0,0,1,1), c(5,4,1,6,3,6,5,3,7,9), dimnames = list(c(1,2,3,4,5,6,7,8,9,10), c(column1,column2))) g- glm(mat[1:10] ~ mat[11:20], family = binomial (link = logit)) g$converged SAS code: data mat; input col1 col2; datalines; 1 5 0 4 1 1 0 6 0 3 1 6 0 5 0 3 1 7 1 9 ; proc logistic data=mat descending; model col1 = col2 / link=logit; run; SAS output (in case you don't have access to SAS): Convergence criterion satisfied Estimate SE Intercept-1.6118 1.7833 col20.3293 0.3383 Of course, with an example this small, it is not so surprising that the two methods differ; and they hardly differ by a single S. But as the datasets get larger, the difference is more pronounced. Let me know if you would like me to send you a large dataset. I get the feeling I am doing something wrong in R, so please let me know what you think. Thank you! Ben Godlove On Thu, Nov 11, 2010 at 1:59 PM, Albyn Jonesjo...@reed.edu wrote: do you have factors (categorical variables) in the model? it could be just a parameterization difference. albyn On Thu, Nov 11, 2010 at 12:41:03PM -0500, Benjamin Godlove wrote: Dear R developers, I have noticed a discrepancy between the coefficients returned by R's glm() for logistic regression and SAS's PROC LOGISTIC. I am using dist = binomial and link = logit for both R and SAS. I believe R uses IRLS whereas SAS uses Fisher's scoring, but the difference is something like 100 SE on the intercept. What accounts for such a huge difference? Thank you for your time. Ben Godlove [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Albyn Jones Reed College jo...@reed.edu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Consistency of Logistic Regression
I think it is likely I am missing something. Here is a very simple example: R code: mat - matrix(nrow = 10, ncol = 2, c(1,0,1,0,0,1,0,0,1,1), c(5,4,1,6,3,6,5,3,7,9), dimnames = list(c(1,2,3,4,5,6,7,8,9,10), c(column1,column2))) g - glm(mat[1:10] ~ mat[11:20], family = binomial (link = logit)) g$converged SAS code: data mat; input col1 col2; datalines; 1 5 0 4 1 1 0 6 0 3 1 6 0 5 0 3 1 7 1 9 ; proc logistic data=mat descending; model col1 = col2 / link=logit; run; SAS output (in case you don't have access to SAS): Convergence criterion satisfied Estimate SE Intercept-1.6118 1.7833 col20.3293 0.3383 Of course, with an example this small, it is not so surprising that the two methods differ; and they hardly differ by a single S. But as the datasets get larger, the difference is more pronounced. Let me know if you would like me to send you a large dataset. I get the feeling I am doing something wrong in R, so please let me know what you think. Thank you! Ben Godlove On Thu, Nov 11, 2010 at 1:59 PM, Albyn Jones jo...@reed.edu wrote: do you have factors (categorical variables) in the model? it could be just a parameterization difference. albyn On Thu, Nov 11, 2010 at 12:41:03PM -0500, Benjamin Godlove wrote: Dear R developers, I have noticed a discrepancy between the coefficients returned by R's glm() for logistic regression and SAS's PROC LOGISTIC. I am using dist = binomial and link = logit for both R and SAS. I believe R uses IRLS whereas SAS uses Fisher's scoring, but the difference is something like 100 SE on the intercept. What accounts for such a huge difference? Thank you for your time. Ben Godlove [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Albyn Jones Reed College jo...@reed.edu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Consistency of Logistic Regression
You are not creating your data set properly. Your 'mat' is: mat column1 column2 11 0 21 0 30 1 40 0 51 1 61 0 71 0 80 1 90 0 10 1 1 What you really want is: DF - data.frame(y = c(1,0,1,0,0,1,0,0,1,1), x = c(5,4,1,6,3,6,5,3,7,9)) DF y x 1 1 5 2 0 4 3 1 1 4 0 6 5 0 3 6 1 6 7 0 5 8 0 3 9 1 7 10 1 9 MOD - glm(y ~ x, data = DF, family = binomial) summary(MOD) Call: glm(formula = y ~ x, family = binomial, data = DF) Deviance Residuals: Min 1Q Median 3Q Max -1.3353 -1.0229 -0.1239 0.9956 1.7477 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -1.6118 1.7833 -0.9040.366 x 0.3293 0.3383 0.9730.330 (Dispersion parameter for binomial family taken to be 1) Null deviance: 13.863 on 9 degrees of freedom Residual deviance: 12.767 on 8 degrees of freedom AIC: 16.767 Number of Fisher Scoring iterations: 4 HTH, Marc Schwartz On Nov 12, 2010, at 12:56 PM, Benjamin Godlove wrote: I think it is likely I am missing something. Here is a very simple example: R code: mat - matrix(nrow = 10, ncol = 2, c(1,0,1,0,0,1,0,0,1,1), c(5,4,1,6,3,6,5,3,7,9), dimnames = list(c(1,2,3,4,5,6,7,8,9,10), c(column1,column2))) g - glm(mat[1:10] ~ mat[11:20], family = binomial (link = logit)) g$converged SAS code: data mat; input col1 col2; datalines; 1 5 0 4 1 1 0 6 0 3 1 6 0 5 0 3 1 7 1 9 ; proc logistic data=mat descending; model col1 = col2 / link=logit; run; SAS output (in case you don't have access to SAS): Convergence criterion satisfied Estimate SE Intercept-1.6118 1.7833 col20.3293 0.3383 Of course, with an example this small, it is not so surprising that the two methods differ; and they hardly differ by a single S. But as the datasets get larger, the difference is more pronounced. Let me know if you would like me to send you a large dataset. I get the feeling I am doing something wrong in R, so please let me know what you think. Thank you! Ben Godlove On Thu, Nov 11, 2010 at 1:59 PM, Albyn Jones jo...@reed.edu wrote: do you have factors (categorical variables) in the model? it could be just a parameterization difference. albyn On Thu, Nov 11, 2010 at 12:41:03PM -0500, Benjamin Godlove wrote: Dear R developers, I have noticed a discrepancy between the coefficients returned by R's glm() for logistic regression and SAS's PROC LOGISTIC. I am using dist = binomial and link = logit for both R and SAS. I believe R uses IRLS whereas SAS uses Fisher's scoring, but the difference is something like 100 SE on the intercept. What accounts for such a huge difference? Thank you for your time. Ben Godlove [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Albyn Jones Reed College jo...@reed.edu [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Consistency of Logistic Regression
Is the algorithm converging? Is there separation (i.e., perfect predictor) in the model? Are you getting a warning about fitted probabilities of 0 or 1?, etc. We would need much more information (preferably a reproducible example) before we can help. Benjamin Godlove wrote: Dear R developers, I have noticed a discrepancy between the coefficients returned by R's glm() for logistic regression and SAS's PROC LOGISTIC. I am using dist = binomial and link = logit for both R and SAS. I believe R uses IRLS whereas SAS uses Fisher's scoring, but the difference is something like 100 SE on the intercept. What accounts for such a huge difference? Thank you for your time. Ben Godlove [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Consistency of Logistic Regression
do you have factors (categorical variables) in the model? it could be just a parameterization difference. albyn On Thu, Nov 11, 2010 at 12:41:03PM -0500, Benjamin Godlove wrote: Dear R developers, I have noticed a discrepancy between the coefficients returned by R's glm() for logistic regression and SAS's PROC LOGISTIC. I am using dist = binomial and link = logit for both R and SAS. I believe R uses IRLS whereas SAS uses Fisher's scoring, but the difference is something like 100 SE on the intercept. What accounts for such a huge difference? Thank you for your time. Ben Godlove [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Albyn Jones Reed College jo...@reed.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Consistency of Logistic Regression
Albyn Jones jones at reed.edu writes: do you have factors (categorical variables) in the model? it could be just a parameterization difference. albyn On Thu, Nov 11, 2010 at 12:41:03PM -0500, Benjamin Godlove wrote: Dear R developers, I have noticed a discrepancy between the coefficients returned by R's glm() for logistic regression and SAS's PROC LOGISTIC. I am using dist = binomial and link = logit for both R and SAS. I believe R uses IRLS whereas SAS uses Fisher's scoring, but the difference is something like 100 SE on the intercept. What accounts for such a huge difference? As previous posters said. Specifically: * a huge change in the intercept is very unlikely to be caused by a change in underlying algorithm unless the data are pathological (separation, convergence issues, etc.) * R's default 'intercept' is the value of the first factor level; SAS's is the value of the last factor level. See ?contr.SAS ... __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.