Re: [R] logistic regression: wls and unbalanced samples
On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote: Greetings from Rio de Janeiro, Brazil. I am looking for advice / references on binary logistic regression with weighted least squares (using lrm weights), on the following context: 1) unbalanced sample (n0=1, n1=700); 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e 3) after modelling, adjust the intercept in order to reflect the expected % of 1’s in the population (e.g., circa 7%, as opposed to 50%). ?? If the proportion of 1 in the population is about 7%, how exactly is the sample unbalanced. I don't see a reason to use weights at all if the sample is representative of the population. The opposite situation, where the sample is balanced (e.g. case-control), the population not, and you are interested in the population values, _that_ might require weighting, with some care because case weighting and sample weighting are two different things so the s.e. will be wrong. That sort of stuff handled by the survey package. However what you seem to be doing is to create results for an artificial 50/50 population, then project back to the population you were sampling from all along. I don't think this makes sense at all. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression: wls and unbalanced samples
On Wed, 27 Apr 2011, peter dalgaard wrote: On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote: Greetings from Rio de Janeiro, Brazil. I am looking for advice / references on binary logistic regression with weighted least squares (using lrm weights), on the following context: 1) unbalanced sample (n0=1, n1=700); 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e 3) after modelling, adjust the intercept in order to reflect the expected % of 1’s in the population (e.g., circa 7%, as opposed to 50%). ?? If the proportion of 1 in the population is about 7%, how exactly is the sample unbalanced. I don't see a reason to use weights at all if the sample is representative of the population. The opposite situation, where the sample is balanced (e.g. case-control), the population not, and you are interested in the population values, _that_ might require weighting, with some care because case weighting and sample weighting are two different things so the s.e. will be wrong. That sort of stuff handled by the survey package. However what you seem to be doing is to create results for an artificial 50/50 population, then project back to the population you were sampling from all along. I don't think this makes sense at all. There are circumstances where it might. It is quite common in pattern recognition for the proportions in the training set to not reflect the population. And if the misclassification costs are asymmetric, you may want to weight the fit. The case I encountered was SGA births. By definition there are about 10% 'successes', but false negatives are far more important than false positives (or one would simply predict all births as normal). This means that you want accurate estimation of probabilities in the right tail of the population distribution, and plug-in estimation of logistic regression is biased. One of many ways to reduce that bias is to re-weight the training set so the estimated probabilities of marginal cases are in the middle of the range. Note that logistic regression is not normally fitted by 'weighted least squares' (not even by 'lrm' from some unstated package). This is not a list for tutorials in advanced statistics, but one reference is my Pattern Recognition and Neural Networks book. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression: wls and unbalanced samples
Many thanks for your messages. I will take a look at the survey package. I was concerned with the issues raised by Cramer (1999) in Predictive performance of the binary logit model in unbalanced samples. In this particular case, misclassification costs are much higher for the smaller group (defaults) than for the larger group (non-defaults). However, I have no specific guidelines for how much higher. If I understood correctly, using sampling weights would help improve accuracy on the smaller group and, at least, I would be able to explain the rationale for the different weights. To cite properly, I was referring to lrm in the Design package (Harrel, 2008). Sorry to have intruded the list with such question, but - once again - thank you for your answers. On Wed, Apr 27, 2011 at 7:29 AM, Prof Brian Ripley rip...@stats.ox.ac.uk wrote: On Wed, 27 Apr 2011, peter dalgaard wrote: On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote: Greetings from Rio de Janeiro, Brazil. I am looking for advice / references on binary logistic regression with weighted least squares (using lrm weights), on the following context: 1) unbalanced sample (n0=1, n1=700); 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e 3) after modelling, adjust the intercept in order to reflect the expected % of 1’s in the population (e.g., circa 7%, as opposed to 50%). ?? If the proportion of 1 in the population is about 7%, how exactly is the sample unbalanced. I don't see a reason to use weights at all if the sample is representative of the population. The opposite situation, where the sample is balanced (e.g. case-control), the population not, and you are interested in the population values, _that_ might require weighting, with some care because case weighting and sample weighting are two different things so the s.e. will be wrong. That sort of stuff handled by the survey package. However what you seem to be doing is to create results for an artificial 50/50 population, then project back to the population you were sampling from all along. I don't think this makes sense at all. There are circumstances where it might. It is quite common in pattern recognition for the proportions in the training set to not reflect the population. And if the misclassification costs are asymmetric, you may want to weight the fit. The case I encountered was SGA births. By definition there are about 10% 'successes', but false negatives are far more important than false positives (or one would simply predict all births as normal). This means that you want accurate estimation of probabilities in the right tail of the population distribution, and plug-in estimation of logistic regression is biased. One of many ways to reduce that bias is to re-weight the training set so the estimated probabilities of marginal cases are in the middle of the range. Note that logistic regression is not normally fitted by 'weighted least squares' (not even by 'lrm' from some unstated package). This is not a list for tutorials in advanced statistics, but one reference is my Pattern Recognition and Neural Networks book. -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] logistic regression: wls and unbalanced samples
Greetings from Rio de Janeiro, Brazil. I am looking for advice / references on binary logistic regression with weighted least squares (using lrm weights), on the following context: 1) unbalanced sample (n0=1, n1=700); 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e 3) after modelling, adjust the intercept in order to reflect the expected % of 1’s in the population (e.g., circa 7%, as opposed to 50%). I have identified references that deal with the last point, but no conclusive article or book dealing with this specific use of weights in unbalaced samples. The area under the ROC is about 0.70, and the estimated probabilities are close to the frequencies of 1’s in different ranges, which looks satisfactory. Hosmer Lemeshow’s test is not significant, as expected. Can someone comment on the adopted strategy, or suggest some specific bibliography that might address the issue of weights and unbalanced samples in logistic regression? Thanks in advance, André Guimarães __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.