Re: [R] Logistic regression for large data

2022-11-14 Thread Bill Dunlap
summary(Base)

would show if one of columns of Base was read as character data instead of
the expected numeric.  That could cause an explosion in the number of dummy
variables, hence a huge design matrix.

-Bill


On Fri, Nov 11, 2022 at 11:30 PM George Brida 
wrote:

> Dear R users,
>
> I have a database  called Base.csv   (attached to this email) which
> contains 13 columns and 8257 rows and whose the first 8 columns are dummy
> variables which take 1 or 0. The problem is when I wrote the following
> instructions to do a logistic regression , R runs for hours and hours
> without giving an output:
>
> Base=read.csv("C:\\Users\\HP\\Desktop\\New\\Base.csv",header=FALSE,sep=";")
>
> fit_1=glm(Base[,2]~Base[,1]+Base[,10]+Base[,11]+Base[,12]+Base[,13],family=binomial(link="logit"))
>
> Apparently, there is not enough memory to have the requested output. Is
> there any other function for logistic regression that handle large data and
> return output in reasonable time.
>
> Many thanks
>
> Kind regards
>
> George
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression for large data

2022-11-12 Thread Ebert,Timothy Aaron
Hi George,
   I did not get an attachment.
   My first step would be to try simplifying things. Do all of these work?

fit_1=glm(Base[,2]~Base[,1],family=binomial(link="logit"))
fit_1=glm(Base[,2]~Base[,10],family=binomial(link="logit"))
fit_1=glm(Base[,2]~Base[,11],family=binomial(link="logit"))
fit_1=glm(Base[,2]~Base[,12],family=binomial(link="logit"))
fit_1=glm(Base[,2]~Base[,13],family=binomial(link="logit"))

This is not a large dataset. That said, if your computer is nearly out of 
memory, even a small dataset might be too much. It might have plenty of 
physical memory, but also lots of (open files, cookies, applications, other 
stuff) that eat memory.

Regards,
Tim



-Original Message-
From: R-help  On Behalf Of George Brida
Sent: Friday, November 11, 2022 4:17 PM
To: r-help@r-project.org
Subject: [R] Logistic regression for large data

[External Email]

Dear R users,

I have a database  called Base.csv   (attached to this email) which
contains 13 columns and 8257 rows and whose the first 8 columns are dummy 
variables which take 1 or 0. The problem is when I wrote the following 
instructions to do a logistic regression , R runs for hours and hours without 
giving an output:

Base=read.csv("C:\\Users\\HP\\Desktop\\New\\Base.csv",header=FALSE,sep=";")
fit_1=glm(Base[,2]~Base[,1]+Base[,10]+Base[,11]+Base[,12]+Base[,13],family=binomial(link="logit"))

Apparently, there is not enough memory to have the requested output. Is there 
any other function for logistic regression that handle large data and return 
output in reasonable time.

Many thanks

Kind regards

George
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-helpdata=05%7C01%7Ctebert%40ufl.edu%7Cb0af80b8620648fcc1ab08dac47fb19e%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638038350110240752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7Csdata=oxeFwGCpH%2B9Ha%2BDFaWRygEcvOJ2O6AngSKNhMwE%2FczI%3Dreserved=0
PLEASE do read the posting guide 
https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.htmldata=05%7C01%7Ctebert%40ufl.edu%7Cb0af80b8620648fcc1ab08dac47fb19e%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638038350110240752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7Csdata=FtLW709kbnkMLzylkRRtR1Y%2Fw5oehodb0dmS8DqwGig%3Dreserved=0
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression for large data

2022-11-11 Thread David Winsemius
That’s not a large data set. Something else besides memory limits is going on. 
You should post output of summary(Base). 

— 
David
Sent from my iPhone

> On Nov 11, 2022, at 11:29 PM, George Brida  wrote:
> 
> Dear R users,
> 
> I have a database  called Base.csv   (attached to this email) which
> contains 13 columns and 8257 rows and whose the first 8 columns are dummy
> variables which take 1 or 0. The problem is when I wrote the following
> instructions to do a logistic regression , R runs for hours and hours
> without giving an output:
> 
> Base=read.csv("C:\\Users\\HP\\Desktop\\New\\Base.csv",header=FALSE,sep=";")
> fit_1=glm(Base[,2]~Base[,1]+Base[,10]+Base[,11]+Base[,12]+Base[,13],family=binomial(link="logit"))
> 
> Apparently, there is not enough memory to have the requested output. Is
> there any other function for logistic regression that handle large data and
> return output in reasonable time.
> 
> Many thanks
> 
> Kind regards
> 
> George
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with Panel Data

2019-04-23 Thread Marc Schwartz via R-help


> On Apr 23, 2019, at 8:26 AM, Paul Bernal  wrote:
> 
> Dear friends, hope you are all doing great,
> 
> I would like to know if there is any R package that allows fitting of
> logistic regression to panel data.
> 
> I installed and loaded package plm, but from what I have read so far, plm
> only allows fitting of linear regression to panel data, not logistic.
> 
> Any help and/or guidance will be greatly appreciated,
> 
> Best regards,
> 
> Paul


Hi Paul,

You might look at the pglm package on CRAN.

Since there can be multiple methodologies that are apropos in this setting, 
such as GEE, mixed-effects, robust covariance matrix, etc., you might want to 
look at the CRAN task view that covers this domain:

  https://cran.r-project.org/web/views/Econometrics.html

Regards,

Marc Schwartz

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression and robust standard errors

2016-07-01 Thread Achim Zeileis

On Fri, 1 Jul 2016, Faradj Koliev wrote:


Dear Achim Zeileis, 
Many thanks for your quick and informative answer. 

I?m sure that the vcovCL should work, however, I experience some problems. 


> coeftest(model, vcov=vcovCL(model, cluster=mydata$ID))

First I got this error: 

Error in vcovCL(model, cluster = mydata$ID) : 
  length of 'cluster' does not match number of observations


This probably means that not all rows of "mydata" have been used for the 
estimation of the model, e.g., due to missing values or something like 
that. Hence, the mismatch seems to occur.



After checking the observations I got this error: 

Error in vcovCL(model, cluster = mydata$ID) : object 'tets' not found
Called from: vcovCL(model, cluster = mydata$ID)
Browse[1]> 


The variable "tets" is presumably one of the regressors in your "model" 
and apparently it cannot be found when extracting the model scores. Maybe 
you haven't stored the model frame and deleted the data.


Hard to say without a simple reproducible example.
Z


What can I do to fix it? What am I doing wrong now? 





  1 jul 2016 kl. 14:57 skrev Achim Zeileis
  :

On Fri, 1 Jul 2016, Faradj Koliev wrote:

  Dear all,

  I use ?polr? command (library: MASS) to estimate an
  ordered logistic regression.

  My model:   summary( model<- polr(y ~ x1+x2+x3+x4+x1*x2
  ,data=mydata, Hess = TRUE))

  But how do I get robust clustered standard errors?
  I??ve tried coeftest(resA, vcov=vcovHC(resA,
  cluster=lipton$ID))


The vcovHC() function currently does not (yet) have a "cluster"
argument. We are working on it but it's not finished yet.

As an alternative I include the vcovCL() function below that computes
the usual simple clustered sandwich estimator. This can be applied to
"polr" objects and plugged into coeftest(). So

coeftest(resA, vcov=vcovCL(resA, cluster=lipton$ID))

should work.

  and summary(a <- robcov(model,mydata$ID)).


The robcov() function does in principle what you want by I'm not sure
whether it works with polr(). But for sure it works with lrm() from
the "rms" package.

Hope that helps,
Z

vcovCL <- function(object, cluster = NULL, adjust = NULL)
{
 stopifnot(require("sandwich"))

 ## cluster specification
 if(is.null(cluster)) cluster <- attr(object, "cluster")
 if(is.null(cluster)) stop("no 'cluster' specification found")
 cluster <- factor(cluster)

 ## estimating functions and dimensions
 ef <- estfun(object)
 n <- NROW(ef)
 k <- NCOL(ef)
 if(n != length(cluster))
   stop("length of 'cluster' does not match number of observations")
 m <- length(levels(cluster))

 ## aggregate estimating functions by cluster and compute meat
 ef <- sapply(levels(cluster), function(i) colSums(ef[cluster == i, ,
   drop = FALSE]))
 ef <- if(NCOL(ef) > 1L) t(ef) else matrix(ef, ncol = 1L)
 mt <- crossprod(ef)/n

 ## bread
 br <- try(bread(object), silent = TRUE)
 if(inherits(br, "try-error")) br <- vcov(object) * n

 ## put together sandwich
 vc <- 1/n * (br %*% mt %*% br)

 ## adjustment
 if(is.null(adjust)) adjust <- class(object)[1L] == "lm"
 adj <- if(adjust) m/(m - 1L) * (n - 1L)/(n - k) else m/(m - 1L)

 ## return
 return(adj * vc)
}


  Neither works for me. So I wonder what am I doing wrong
  here?


  All suggestions are welcome ? thank you!
  [[alternative HTML version deleted]]

  __
  R-help@r-project.org mailing list -- To UNSUBSCRIBE and
  more, see
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained,
  reproducible code.





__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Logistic regression and robust standard errors

2016-07-01 Thread Faradj Koliev
Dear Achim Zeileis, 

Many thanks for your quick and informative answer. 

I’m sure that the vcovCL should work, however, I experience some problems. 


> coeftest(model, vcov=vcovCL(model, cluster=mydata$ID))

First I got this error: 

Error in vcovCL(model, cluster = mydata$ID) : 
  length of 'cluster' does not match number of observations

After checking the observations I got this error: 

Error in vcovCL(model, cluster = mydata$ID) : object 'tets' not found
Called from: vcovCL(model, cluster = mydata$ID)
Browse[1]> 

What can I do to fix it? What am I doing wrong now? 





> 1 jul 2016 kl. 14:57 skrev Achim Zeileis :
> 
> On Fri, 1 Jul 2016, Faradj Koliev wrote:
> 
>> Dear all, 
>> 
>> I use ?polr? command (library: MASS) to estimate an ordered logistic 
>> regression.
>> 
>> My model:   summary( model<- polr(y ~ x1+x2+x3+x4+x1*x2 ,data=mydata, Hess = 
>> TRUE))
>> 
>> But how do I get robust clustered standard errors? 
>> I??ve tried coeftest(resA, vcov=vcovHC(resA, cluster=lipton$ID))
> 
> The vcovHC() function currently does not (yet) have a "cluster" argument. We 
> are working on it but it's not finished yet.
> 
> As an alternative I include the vcovCL() function below that computes the 
> usual simple clustered sandwich estimator. This can be applied to "polr" 
> objects and plugged into coeftest(). So
> 
> coeftest(resA, vcov=vcovCL(resA, cluster=lipton$ID))
> 
> should work.
> 
>> and summary(a <- robcov(model,mydata$ID)).
> 
> The robcov() function does in principle what you want by I'm not sure whether 
> it works with polr(). But for sure it works with lrm() from the "rms" package.
> 
> Hope that helps,
> Z
> 
> vcovCL <- function(object, cluster = NULL, adjust = NULL)
> {
>  stopifnot(require("sandwich"))
> 
>  ## cluster specification
>  if(is.null(cluster)) cluster <- attr(object, "cluster")
>  if(is.null(cluster)) stop("no 'cluster' specification found")
>  cluster <- factor(cluster)
> 
>  ## estimating functions and dimensions
>  ef <- estfun(object)
>  n <- NROW(ef)
>  k <- NCOL(ef)
>  if(n != length(cluster))
>stop("length of 'cluster' does not match number of observations")
>  m <- length(levels(cluster))
> 
>  ## aggregate estimating functions by cluster and compute meat
>  ef <- sapply(levels(cluster), function(i) colSums(ef[cluster == i, ,
>drop = FALSE]))
>  ef <- if(NCOL(ef) > 1L) t(ef) else matrix(ef, ncol = 1L)
>  mt <- crossprod(ef)/n
> 
>  ## bread
>  br <- try(bread(object), silent = TRUE)
>  if(inherits(br, "try-error")) br <- vcov(object) * n
> 
>  ## put together sandwich
>  vc <- 1/n * (br %*% mt %*% br)
> 
>  ## adjustment
>  if(is.null(adjust)) adjust <- class(object)[1L] == "lm"
>  adj <- if(adjust) m/(m - 1L) * (n - 1L)/(n - k) else m/(m - 1L)
> 
>  ## return
>  return(adj * vc)
> }
> 
> 
>> Neither works for me. So I wonder what am I doing wrong here?
>> 
>> 
>> All suggestions are welcome ? thank you!
>>  [[alternative HTML version deleted]]
>> 
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Logistic regression and robust standard errors

2016-07-01 Thread Achim Zeileis

On Fri, 1 Jul 2016, Faradj Koliev wrote:

Dear all, 



I use ?polr? command (library: MASS) to estimate an ordered logistic regression.

My model:   summary( model<- polr(y ~ x1+x2+x3+x4+x1*x2 ,data=mydata, Hess = 
TRUE))

But how do I get robust clustered standard errors? 


I??ve tried coeftest(resA, vcov=vcovHC(resA, cluster=lipton$ID))


The vcovHC() function currently does not (yet) have a "cluster" argument. 
We are working on it but it's not finished yet.


As an alternative I include the vcovCL() function below that computes the 
usual simple clustered sandwich estimator. This can be applied to "polr" 
objects and plugged into coeftest(). So


coeftest(resA, vcov=vcovCL(resA, cluster=lipton$ID))

should work.


and summary(a <- robcov(model,mydata$ID)).


The robcov() function does in principle what you want by I'm not sure 
whether it works with polr(). But for sure it works with lrm() from the 
"rms" package.


Hope that helps,
Z

vcovCL <- function(object, cluster = NULL, adjust = NULL)
{
  stopifnot(require("sandwich"))

  ## cluster specification
  if(is.null(cluster)) cluster <- attr(object, "cluster")
  if(is.null(cluster)) stop("no 'cluster' specification found")
  cluster <- factor(cluster)

  ## estimating functions and dimensions
  ef <- estfun(object)
  n <- NROW(ef)
  k <- NCOL(ef)
  if(n != length(cluster))
stop("length of 'cluster' does not match number of observations")
  m <- length(levels(cluster))

  ## aggregate estimating functions by cluster and compute meat
  ef <- sapply(levels(cluster), function(i) colSums(ef[cluster == i, ,
drop = FALSE]))
  ef <- if(NCOL(ef) > 1L) t(ef) else matrix(ef, ncol = 1L)
  mt <- crossprod(ef)/n

  ## bread
  br <- try(bread(object), silent = TRUE)
  if(inherits(br, "try-error")) br <- vcov(object) * n

  ## put together sandwich
  vc <- 1/n * (br %*% mt %*% br)

  ## adjustment
  if(is.null(adjust)) adjust <- class(object)[1L] == "lm"
  adj <- if(adjust) m/(m - 1L) * (n - 1L)/(n - k) else m/(m - 1L)

  ## return
  return(adj * vc)
}



Neither works for me. So I wonder what am I doing wrong here?


All suggestions are welcome ? thank you!
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression output baseline (reference) category

2016-03-26 Thread David Winsemius

> On Mar 25, 2016, at 10:19 PM, Michael Artz  wrote:
> 
> Hi,
>  I have now read an introductory text on regression and I think I do
> understand what the intercept is doing.  However, my original question is
> still unanswered.  I understand that the intercept term is the constant
> that each other term is measured against.  I think baseline is a good word
> for it.  However, it does not represent any one of the x variables by
> itself.

It represents all of the X variables at their reference levels. There are no 
individual intercepts on a variable-by-variable basis. You accepted the notion 
of "baseline" You cannot parcel out single variable intercepts. What would they 
actually mean anyhow?



>  Is there a way in R, to extrapolate the individual x variable
> intercepts from the equation somehow.

If you describe the process by which that could be calculated, we might have 
basis for discussion, but as it is I think you still need to be studying the 
theory more. I don't intend any further resspones on R-help where these 
questions are off-topic, so you should direct any further questions to 
stats.stackexchange.com

> 
> 
> On Tue, Mar 15, 2016 at 8:26 PM, David Winsemius 
> wrote:
> 
>> 
>>> On Mar 15, 2016, at 1:27 PM, Michael Artz 
>> wrote:
>>> 
>>> Hi,
>>>  I am trying to use the summary from the glm function as a data source.
>> I
>>> am using the call sink() then
>>> summary(logisticRegModel)$coefficients then sink().
>> 
>> Since it's a matrix you may need to locate a function that write matrices
>> to files. I seem to remember that the MASS package has one.
>> 
>>> The independent
>>> variables are categorical and thus there is always a baseline value for
>>> every category that is omitted from the glm output.
>> 
>> Well, it's not really omitted, so much as shared among all variables. For
>> further reading in the halp pages consult:
>> 
>> ?model.matrix
>> ?contrasts
>> ?contr.treatment
>> 
>> But you probably need to supplement that with an introductory text that
>> covers R regression defaults.
>> 
>>> I am interested in how
>>> to get the Z column for all of the categorical values.
>> 
>> The Z column? You meant the "z value" column. Again, since it's a matrix
>> you need to use column indexing with "["
>> 
>> summary(logisticRegModel)$coefficients[  , "z value"]
>> 
>> Read up on the summary function for glm objects at:
>> 
>> ?summary.glm
>> 
>> 
>>> I don't see any row
>>> for the reference category.
>> 
>> What do you imagine the (Intercept) row to be doing? If you are having
>> difficulty understanding this (which is not really an R-specific issue)
>> there are probably already several explanations to similar questions on:
>> 
>> http://stats.stackexchange.com/
>> 
>> 
>>> 
>>> How can I get this Z value in the output?
>> 
>> Asked and answered.
>> 

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression output baseline (reference) category

2016-03-25 Thread Michael Artz
Hi,
  I have now read an introductory text on regression and I think I do
understand what the intercept is doing.  However, my original question is
still unanswered.  I understand that the intercept term is the constant
that each other term is measured against.  I think baseline is a good word
for it.  However, it does not represent any one of the x variables by
itself.  Is there a way in R, to extrapolate the individual x variable
intercepts from the equation somehow.


On Tue, Mar 15, 2016 at 8:26 PM, David Winsemius 
wrote:

>
> > On Mar 15, 2016, at 1:27 PM, Michael Artz 
> wrote:
> >
> > Hi,
> >   I am trying to use the summary from the glm function as a data source.
> I
> > am using the call sink() then
> > summary(logisticRegModel)$coefficients then sink().
>
> Since it's a matrix you may need to locate a function that write matrices
> to files. I seem to remember that the MASS package has one.
>
> >  The independent
> > variables are categorical and thus there is always a baseline value for
> > every category that is omitted from the glm output.
>
> Well, it's not really omitted, so much as shared among all variables. For
> further reading in the halp pages consult:
>
> ?model.matrix
> ?contrasts
> ?contr.treatment
>
> But you probably need to supplement that with an introductory text that
> covers R regression defaults.
>
> >  I am interested in how
> > to get the Z column for all of the categorical values.
>
> The Z column? You meant the "z value" column. Again, since it's a matrix
> you need to use column indexing with "["
>
> summary(logisticRegModel)$coefficients[  , "z value"]
>
> Read up on the summary function for glm objects at:
>
> ?summary.glm
>
>
> >  I don't see any row
> > for the reference category.
>
> What do you imagine the (Intercept) row to be doing? If you are having
> difficulty understanding this (which is not really an R-specific issue)
> there are probably already several explanations to similar questions on:
>
> http://stats.stackexchange.com/
>
>
> >
> > How can I get this Z value in the output?
>
> Asked and answered.
>
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression output baseline (reference) category

2016-03-15 Thread David Winsemius

> On Mar 15, 2016, at 1:27 PM, Michael Artz  wrote:
> 
> Hi,
>   I am trying to use the summary from the glm function as a data source. I
> am using the call sink() then
> summary(logisticRegModel)$coefficients then sink().

Since it's a matrix you may need to locate a function that write matrices to 
files. I seem to remember that the MASS package has one. 

>  The independent
> variables are categorical and thus there is always a baseline value for
> every category that is omitted from the glm output.

Well, it's not really omitted, so much as shared among all variables. For 
further reading in the halp pages consult:

?model.matrix
?contrasts
?contr.treatment

But you probably need to supplement that with an introductory text that covers 
R regression defaults.

>  I am interested in how
> to get the Z column for all of the categorical values.

The Z column? You meant the "z value" column. Again, since it's a matrix you 
need to use column indexing with "["

summary(logisticRegModel)$coefficients[  , "z value"]

Read up on the summary function for glm objects at:

?summary.glm


>  I don't see any row
> for the reference category.

What do you imagine the (Intercept) row to be doing? If you are having 
difficulty understanding this (which is not really an R-specific issue) there 
are probably already several explanations to similar questions on:

http://stats.stackexchange.com/

 
> 
> How can I get this Z value in the output?

Asked and answered.

> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression output baseline (reference) category

2016-03-15 Thread Bert Gunter
The reference category is aliased with the constant term in the
default contr.treatment contrasts.

See ?contr.treatment , ?C, ?contrasts

If you don't know what this means, you should probably consult a local
statistical resource or ask about linear model contrasts at a
statistical help website like stats.stackexchange.com. This list is
for R programming questions.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Tue, Mar 15, 2016 at 1:27 PM, Michael Artz  wrote:
> Hi,
>I am trying to use the summary from the glm function as a data source. I
> am using the call sink() then
> summary(logisticRegModel)$coefficients then sink().  The independent
> variables are categorical and thus there is always a baseline value for
> every category that is omitted from the glm output.  I am interested in how
> to get the Z column for all of the categorical values.  I don't see any row
> for the reference category.  How can I get this Z value in the output?
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2016-01-25 Thread Greg Snow
Do you have the sample sizes that the sample proportions were computed
from (e.g. 0.5 could be 1 out of 2 or 100 out of 200)?

If you do then you can specify the model with the proportions as the y
variable and the corresponding sample sizes as the weights argument to
glm.

If you only have proportions without an integer sample size then you
may want to switch to using beta regression instead of logistic
regression.

On Sat, Jan 23, 2016 at 1:41 PM, pari hesabi  wrote:
> Hello everybody,
>
> I am trying to fit a logistic regression model by using glm() function in R. 
> My response variable is a sample proportion NOT binary numbers(0,1).
>
> Regarding glm() function, I receive this error:  non integer # successes in a 
> binomial glm!
>
> I would appreciate if anybody conducts me.
>
>
> Regards,
>
> Pari
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Gregory (Greg) L. Snow Ph.D.
538...@gmail.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2016-01-25 Thread Wensui Liu
But beta can only be used to model the open interval between zero and one

On Monday, January 25, 2016, Greg Snow <538...@gmail.com> wrote:

> Do you have the sample sizes that the sample proportions were computed
> from (e.g. 0.5 could be 1 out of 2 or 100 out of 200)?
>
> If you do then you can specify the model with the proportions as the y
> variable and the corresponding sample sizes as the weights argument to
> glm.
>
> If you only have proportions without an integer sample size then you
> may want to switch to using beta regression instead of logistic
> regression.
>
> On Sat, Jan 23, 2016 at 1:41 PM, pari hesabi  > wrote:
> > Hello everybody,
> >
> > I am trying to fit a logistic regression model by using glm() function
> in R. My response variable is a sample proportion NOT binary numbers(0,1).
> >
> > Regarding glm() function, I receive this error:  non integer # successes
> in a binomial glm!
> >
> > I would appreciate if anybody conducts me.
> >
> >
> > Regards,
> >
> > Pari
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org  mailing list -- To UNSUBSCRIBE and
> more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> 538...@gmail.com 
>
> __
> R-help@r-project.org  mailing list -- To UNSUBSCRIBE and
> more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
WenSui Liu
https://statcompute.wordpress.com/

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2016-01-23 Thread John C Frain
Alternatively you might use log(p/1-p) as your dependent variable and use
OLS with robust standard errors. Much of your inference would be analogous
to a logistic regression

John C Frain
3 Aranleigh Park
Rathfarnham
Dublin 14
Ireland
www.tcd.ie/Economics/staff/frainj/home.html
mailto:fra...@tcd.ie
mailto:fra...@gmail.com

On 23 January 2016 at 20:46, David Winsemius  wrote:

>
> > On Jan 23, 2016, at 12:41 PM, pari hesabi 
> wrote:
> >
> > Hello everybody,
> >
> > I am trying to fit a logistic regression model by using glm() function
> in R. My response variable is a sample proportion NOT binary numbers(0,1).
>
> So multiply the sample proportions (and 1-proportions) by the number of
> samples, round to integers, you will have an appropriate response variable
> and complements, and you can fit a binomial model.
>
> >
> > Regarding glm() function, I receive this error:  non integer # successes
> in a binomial glm!
> >
> > I would appreciate if anybody conducts me.
> >
> >
> > Regards,
> >
> > Pari
> >
> >   [[alternative HTML version deleted]]
> --
>
> David Winsemius
> Alameda, CA, USA
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2016-01-23 Thread Wensui Liu
with glm(), you might try the quasi binomial family

On Saturday, January 23, 2016, pari hesabi  wrote:

> Hello everybody,
>
> I am trying to fit a logistic regression model by using glm() function in
> R. My response variable is a sample proportion NOT binary numbers(0,1).
>
> Regarding glm() function, I receive this error:  non integer # successes
> in a binomial glm!
>
> I would appreciate if anybody conducts me.
>
>
> Regards,
>
> Pari
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org  mailing list -- To UNSUBSCRIBE and
> more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
WenSui Liu
https://statcompute.wordpress.com/

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2016-01-23 Thread David Winsemius

> On Jan 23, 2016, at 12:41 PM, pari hesabi  wrote:
> 
> Hello everybody,
> 
> I am trying to fit a logistic regression model by using glm() function in R. 
> My response variable is a sample proportion NOT binary numbers(0,1).

So multiply the sample proportions (and 1-proportions) by the number of 
samples, round to integers, you will have an appropriate response variable and 
complements, and you can fit a binomial model.

> 
> Regarding glm() function, I receive this error:  non integer # successes in a 
> binomial glm!
> 
> I would appreciate if anybody conducts me.
> 
> 
> Regards,
> 
> Pari
> 
>   [[alternative HTML version deleted]]
-- 

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression R and Stata grouping variable

2015-05-27 Thread Therneau, Terry M., Ph.D.
You were not completely clear, but it appears that you have data where each subject has 
results from 8 trials, as a pair of variables is changed.  If that is correct, then you 
want to have a variance that corrects for the repeated measures.  In R the glm command 
handles the simple case but not the repeated measures one.  Statisticially you can use a 
generalized estimating equations approach (package gee) or a random effect per subject 
approach (lme or lmer package).


Terry T.


On 05/27/2015 05:00 AM, r-help-requ...@r-project.org wrote:

I mostly use Stata 13 for my regression analysis. I want to conduct a logistic 
regression on a proportion/number of success. Because I receive errors in Stata 
I did not expect nor understand (if there are Stata experts who want to know 
more about the problems I face and can potentially help me solve them, I would 
be glad to give more details), I want to repeat the analysis in R. In Stata I 
would use the command: xtlogit DEP_PROP INDEP_A INDEP_B INDEP_C, i(ID). ID is 
the identifier for each subject. There are eight lines with data for each 
subject because there are three within factors (INDEP_A, B, C) with two levels 
each (0 and 1). I can repeat this analysis in R by using the command: 
glm(DEP_SUC ~ INDEP_A + INDEP_B + INDEP_C, family = ?binomial?). DEP_SUC is 
here a table with the successes and misses per row. Again, there are eight rows 
for each subject. But while I know how to group these lines in Stata by using 
the option i(ID ), I do not know what to do in R. I have se!

ar!

  ch for more information about the i() command, but did not find any usefull 
information.

So, to summarize: I want to find out how three variables (binary) influence a 
proportion and use logistic regression. In Stata I can group multiple lines per 
subject using the i( ) command in logistic regression. What is the equivalent 
in R?

Thank you in advance!


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression for data with repeated measures

2014-07-01 Thread Mitchell Maltenfort
http://stats.stackexchange.com/questions/62225/conditional-logistic-regression-vs-glmm-in-r
might be a good start

Ersatzistician and Chutzpahthologist
I can answer any question.  I don't know is an answer. I don't know
yet is a better answer.


On Tue, Jul 1, 2014 at 10:24 AM, Suzon Sepp suzon.s...@gmail.com wrote:
 Hi,

 It seems that I'm quite lost in this wide and powerful R's universe, so I
 permit myself to ask your help about issues with which I'm struggling.
 Thank you,

 I would like to know if the answer’s accuracy (correct = 1; incorrect = 0)
 varies depending on 2 categorical variables which are the group (A and B)
 and the condition (a, b and c) knowing that I’ve got n subjects and 12
 trials by conditions for each subject (i.e. 12 repetitions).

 To do that, I’m focusing on logistic regression analysis. I’ve got no
 problem with this kind of analysis until now (logistic regression with
 numeric predictor variables and/or categorical predictor with 2 levels
 only) but, in this new context, I think I have to focus more specifically
 on logistic regression including *nested (or random?) factors* in a*repeated
 measures design* (because of the variables “Subject” and “Trial”) with a
 categorical predictor variable with *more than 2 levels* (the variable
 “Condition”) and I never did such a thing…yet.

 mydata =
 mydata$Subject: Factor w/38 levels: i01, i02, i03, i04
 mydata$Group: Factor w/2 levels: A, B
 mydata$Condition: Factor w/3 levels: a, b, c
 mydata$Trial: Factor w/12 levels: t01, t02, ...t12
 mydata$Accuracy: Factor w/2 levels: 0, 1

 Subject  Group  Trial  Condition  Accuracy
i01  A   t01 a   0
i01  A   t02 a   1
 ...
i01  A   t12 a   1
i01  A   t01 b   1
i01  A   t02 b   1
 ...
i01  A   t12 b   0
i01  A   t01 c   0
i01  A   t02 c   1
 ...
i01  A   t12 c   1
i02  B   t01 a   1
 ...

 First, I’m wondering if I have to calculate a % of accuracy for each
 subject and each condition and thus “remove” the variable “Trial” but
 “lose” data (power?) in the same time… or to take into account this
 variable in the analysis and in this case, how to do that?

 Second, I don’t know which function I’ve to choose (lmer, glm, glmer…)?

 Third, I’m not sure I proceed correctly to specify in this analysis that
 the responses all come from the same subject: within-subject design =
 …+(1|Subject) as I can do for a repeated measures ANOVA to analyze the
 effect of my different variables on a numeric one such as the response
 time: 
 test=aov(Int~Group*Condition+*Error(Subject/(Group*Condition)*),data=mydata)
 and here again how can I add the variable Trial if I don't work on an
 average reaction time for each subject in the different conditions?

 Below, examples of models I can write with glmer(),

 fit.1=glmer(Accuracy~Group* Condition
 +(1|Subject),data=mydata,family=binomial)

 fit.2=glmer(Accuracy~Group* Condition
 +(1|Subject)-1,data=mydata,family=binomial)   (“without intercept”)

 fit.3=glmer(Accuracy~Group* Condition +(1|Subject)+(1|Trial)...??


 I believed the analysis I've to conduct will be in the range of my
 qualifications then I realize it could be more complicated than that of
 course (ex GLMMs), I can hear do it as we do usually (=repeated measures
 ANOVA on a percentage of correct answers for each subject ??) as if there's
 only one way to follow but I think there's a lot, which one's revelant for
 my data, that's I want to find.

 Hope you can put me on the track,

 Best

 Suzon

 [[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2014-06-14 Thread Suzen, Mehmet
You might want to read this vignette:
http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf

On 14 June 2014 19:53, javad bayat j.bayat...@gmail.com wrote:
 Dear all, I have to use Zelig package for doing logistic regression.
 How can I use Zelig package for logistic regression?

 I did this code by glm function:

 glm1 = glm(kod~Curv+Elev+Out.c+Slope+Aspect,data=data,
family=binomial)
 summary(glm1)

 But the results were not appropriate for my data.

 Many thanks for your helps.







 --
 Best Regards
 Javad Bayat
 M.Sc. Environment Engineering
 Shahid Beheshti (National) University (SBU)
 Alternative Mail: bayat...@yahoo.com

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Eik Vettorazzi
it is simply because you can't do a regression with more predictors than
observations.

Cheers.

Am 12.12.2013 09:00, schrieb Romeo Kienzler:
 Dear List,
 
 I'm quite new to R and want to do logistic regression with a 200K
 feature data set (around 150 training examples).
 
 I'm aware that I should use Naive Bayes but I have a more general
 question about the capability of R handling very high dimensional data.
 
 Please consider the following R code where mygenestrain.tab is a 150
 by 20 matrix:
 
 traindata - read.table('mygenestrain.tab');
 mylogit - glm(V1 ~ ., data = traindata, family = binomial);
 
 When executing this code I get the following error:
 
 Error in terms.formula(formula, data = data) :
   allocMatrix: too many elements specified
 Calls: glm ... model.frame - model.frame.default - terms - terms.formula
 Execution halted
 
 Is this because R can't handle 200K features or am I doing something
 completely wrong here?
 
 Thanks a lot for your help!
 
 best Regards,
 
 Romeo
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
Eik Vettorazzi

Department of Medical Biometry and Epidemiology
University Medical Center Hamburg-Eppendorf

Martinistr. 52
20246 Hamburg

T ++49/40/7410-58243
F ++49/40/7410-57790
--

Besuchen Sie uns auf: www.uke.de
_

Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; 
Gerichtsstand: Hamburg
Vorstandsmitglieder: Prof. Dr. Christian Gerloff (Vertreter des Vorsitzenden), 
Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Prölß, Rainer Schoppik
_

SAVE PAPER - THINK BEFORE PRINTING

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Eik Vettorazzi
I thought so (with all the limitations due to collinearity and so on),
but actually there is a limit for the maximum size of an array which is
independent of your memory size and is due to the way arrays are
indexed. You can't create an object with more than 2^31-1 = 2147483647
elements.

https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html

cheers

Am 12.12.2013 12:34, schrieb Romeo Kienzler:
 ok, so 200K predictors an 10M observations would work?
 
 
 On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
 it is simply because you can't do a regression with more predictors than
 observations.

 Cheers.

 Am 12.12.2013 09:00, schrieb Romeo Kienzler:
 Dear List,

 I'm quite new to R and want to do logistic regression with a 200K
 feature data set (around 150 training examples).

 I'm aware that I should use Naive Bayes but I have a more general
 question about the capability of R handling very high dimensional data.

 Please consider the following R code where mygenestrain.tab is a 150
 by 20 matrix:

 traindata - read.table('mygenestrain.tab');
 mylogit - glm(V1 ~ ., data = traindata, family = binomial);

 When executing this code I get the following error:

 Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
 Calls: glm ... model.frame - model.frame.default - terms -
 terms.formula
 Execution halted

 Is this because R can't handle 200K features or am I doing something
 completely wrong here?

 Thanks a lot for your help!

 best Regards,

 Romeo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

-- 
Eik Vettorazzi

Department of Medical Biometry and Epidemiology
University Medical Center Hamburg-Eppendorf

Martinistr. 52
20246 Hamburg

T ++49/40/7410-58243
F ++49/40/7410-57790
--

Besuchen Sie uns auf: www.uke.de
_

Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; 
Gerichtsstand: Hamburg
Vorstandsmitglieder: Prof. Dr. Christian Gerloff (Vertreter des Vorsitzenden), 
Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Prölß, Rainer Schoppik
_

SAVE PAPER - THINK BEFORE PRINTING

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Duncan Murdoch

On 13-12-12 6:51 AM, Eik Vettorazzi wrote:

I thought so (with all the limitations due to collinearity and so on),
but actually there is a limit for the maximum size of an array which is
independent of your memory size and is due to the way arrays are
indexed. You can't create an object with more than 2^31-1 = 2147483647
elements.

https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html


That post is from 2007.  The limits were raised considerably when R 
3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52 
for working in memory.


Duncan Murdoch




cheers

Am 12.12.2013 12:34, schrieb Romeo Kienzler:

ok, so 200K predictors an 10M observations would work?


On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:

it is simply because you can't do a regression with more predictors than
observations.

Cheers.

Am 12.12.2013 09:00, schrieb Romeo Kienzler:

Dear List,

I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).

I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.

Please consider the following R code where mygenestrain.tab is a 150
by 20 matrix:

traindata - read.table('mygenestrain.tab');
mylogit - glm(V1 ~ ., data = traindata, family = binomial);

When executing this code I get the following error:

Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame - model.frame.default - terms -
terms.formula
Execution halted

Is this because R can't handle 200K features or am I doing something
completely wrong here?

Thanks a lot for your help!

best Regards,

Romeo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Eik Vettorazzi
thanks Duncan for this clarification.
A double precision matrix with 2e11 elements (as the op wanted) would
need about 1.5 TB memory, that's more than a standard (windows 64bit)
computer can handle.

Cheers.

Am 12.12.2013 13:00, schrieb Duncan Murdoch:
 On 13-12-12 6:51 AM, Eik Vettorazzi wrote:
 I thought so (with all the limitations due to collinearity and so on),
 but actually there is a limit for the maximum size of an array which is
 independent of your memory size and is due to the way arrays are
 indexed. You can't create an object with more than 2^31-1 = 2147483647
 elements.

 https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html
 
 That post is from 2007.  The limits were raised considerably when R
 3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52
 for working in memory.
 
 Duncan Murdoch
 
 

 cheers

 Am 12.12.2013 12:34, schrieb Romeo Kienzler:
 ok, so 200K predictors an 10M observations would work?


 On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
 it is simply because you can't do a regression with more predictors
 than
 observations.

 Cheers.

 Am 12.12.2013 09:00, schrieb Romeo Kienzler:
 Dear List,

 I'm quite new to R and want to do logistic regression with a 200K
 feature data set (around 150 training examples).

 I'm aware that I should use Naive Bayes but I have a more general
 question about the capability of R handling very high dimensional
 data.

 Please consider the following R code where mygenestrain.tab is a 150
 by 20 matrix:

 traindata - read.table('mygenestrain.tab');
 mylogit - glm(V1 ~ ., data = traindata, family = binomial);

 When executing this code I get the following error:

 Error in terms.formula(formula, data = data) :
 allocMatrix: too many elements specified
 Calls: glm ... model.frame - model.frame.default - terms -
 terms.formula
 Execution halted

 Is this because R can't handle 200K features or am I doing something
 completely wrong here?

 Thanks a lot for your help!

 best Regards,

 Romeo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 

-- 
Eik Vettorazzi

Department of Medical Biometry and Epidemiology
University Medical Center Hamburg-Eppendorf

Martinistr. 52
20246 Hamburg

T ++49/40/7410-58243
F ++49/40/7410-57790
--

Besuchen Sie uns auf: www.uke.de
_

Universitätsklinikum Hamburg-Eppendorf; Körperschaft des öffentlichen Rechts; 
Gerichtsstand: Hamburg
Vorstandsmitglieder: Prof. Dr. Christian Gerloff (Vertreter des Vorsitzenden), 
Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Prölß, Rainer Schoppik
_

SAVE PAPER - THINK BEFORE PRINTING

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Duncan Murdoch

On 12/12/2013 7:08 AM, Eik Vettorazzi wrote:

thanks Duncan for this clarification.
A double precision matrix with 2e11 elements (as the op wanted) would
need about 1.5 TB memory, that's more than a standard (windows 64bit)
computer can handle.


According to Microsoft's Memory Limits web page (currently at 
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx#memory_limits, 
but these things tend to move around), the limit is 8 TB for virtual 
memory.(The same page lists a variety of smaller physical memory 
limits, depending on the Windows version, but R doesn't need physical 
memory, virtual is good enough. )


R would be very slow if it was working with objects bigger than physical 
memory, but it could conceivably work.


Duncan Murdoch


Cheers.

Am 12.12.2013 13:00, schrieb Duncan Murdoch:
 On 13-12-12 6:51 AM, Eik Vettorazzi wrote:
 I thought so (with all the limitations due to collinearity and so on),
 but actually there is a limit for the maximum size of an array which is
 independent of your memory size and is due to the way arrays are
 indexed. You can't create an object with more than 2^31-1 = 2147483647
 elements.

 https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html

 That post is from 2007.  The limits were raised considerably when R
 3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52
 for working in memory.

 Duncan Murdoch



 cheers

 Am 12.12.2013 12:34, schrieb Romeo Kienzler:
 ok, so 200K predictors an 10M observations would work?


 On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
 it is simply because you can't do a regression with more predictors
 than
 observations.

 Cheers.

 Am 12.12.2013 09:00, schrieb Romeo Kienzler:
 Dear List,

 I'm quite new to R and want to do logistic regression with a 200K
 feature data set (around 150 training examples).

 I'm aware that I should use Naive Bayes but I have a more general
 question about the capability of R handling very high dimensional
 data.

 Please consider the following R code where mygenestrain.tab is a 150
 by 20 matrix:

 traindata - read.table('mygenestrain.tab');
 mylogit - glm(V1 ~ ., data = traindata, family = binomial);

 When executing this code I get the following error:

 Error in terms.formula(formula, data = data) :
 allocMatrix: too many elements specified
 Calls: glm ... model.frame - model.frame.default - terms -
 terms.formula
 Execution halted

 Is this because R can't handle 200K features or am I doing something
 completely wrong here?

 Thanks a lot for your help!

 best Regards,

 Romeo

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.






__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Romeo Kienzler

ok, so 200K predictors an 10M observations would work?


On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:

it is simply because you can't do a regression with more predictors than
observations.

Cheers.

Am 12.12.2013 09:00, schrieb Romeo Kienzler:

Dear List,

I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).

I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.

Please consider the following R code where mygenestrain.tab is a 150
by 20 matrix:

traindata - read.table('mygenestrain.tab');
mylogit - glm(V1 ~ ., data = traindata, family = binomial);

When executing this code I get the following error:

Error in terms.formula(formula, data = data) :
   allocMatrix: too many elements specified
Calls: glm ... model.frame - model.frame.default - terms - terms.formula
Execution halted

Is this because R can't handle 200K features or am I doing something
completely wrong here?

Thanks a lot for your help!

best Regards,

Romeo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with 200K features in R?

2013-12-12 Thread Romeo Kienzler

Dear Eik,

thank you so much for your help!

best Regards,

Romeo

On 12/12/2013 12:51 PM, Eik Vettorazzi wrote:

I thought so (with all the limitations due to collinearity and so on),
but actually there is a limit for the maximum size of an array which is
independent of your memory size and is due to the way arrays are
indexed. You can't create an object with more than 2^31-1 = 2147483647
elements.

https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html

cheers

Am 12.12.2013 12:34, schrieb Romeo Kienzler:

ok, so 200K predictors an 10M observations would work?


On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:

it is simply because you can't do a regression with more predictors than
observations.

Cheers.

Am 12.12.2013 09:00, schrieb Romeo Kienzler:

Dear List,

I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).

I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.

Please consider the following R code where mygenestrain.tab is a 150
by 20 matrix:

traindata - read.table('mygenestrain.tab');
mylogit - glm(V1 ~ ., data = traindata, family = binomial);

When executing this code I get the following error:

Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame - model.frame.default - terms -
terms.formula
Execution halted

Is this because R can't handle 200K features or am I doing something
completely wrong here?

Thanks a lot for your help!

best Regards,

Romeo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2013-04-19 Thread David Winsemius

On Apr 19, 2013, at 11:45 AM, Endy BlackEndy wrote:

 Please read the attach file.

Please re-read : http://www.r-project.org/mail.html#instructions

 Thank you
 Endy
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression

2013-04-14 Thread Jose Iparraguirre
Endy,

See the package ResourceSelection for the HL test and the package caret for the 
sensitivity and specificity measures.
Regards,

Jose Iparraguirre
Chief Economist
Age UK, London


From: r-help-boun...@r-project.org [r-help-boun...@r-project.org] On Behalf Of 
Endy BlackEndy [pert...@gmail.com]
Sent: 14 April 2013 19:05
To: R-Help
Subject: [R] Logistic regression

I have a data set to be analyzed using to binary logistic regression. The
data set is iin grouped form. My question is: how I can compute
Hosmer-Lemeshow test and measures like sensitivity and specificity? Any
suggestion will be greatly appreciated.

Thank you

Endy

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Please donate to the Syria Crisis Appeal by text or online:

To donate £5 by mobile, text SYRIA to 70800.  To donate online, please visit 

http://www.ageinternational.org.uk/syria

Over one million refugees are desperately in need of water, food, healthcare, 
warm clothing, 
blankets and shelter; Age International urgently needs your support to help 
affected older refugees.


Age International is a subsidiary charity of Age UK and a member of the 
Disasters Emergency Committee (DEC).  
The DEC launches and co-ordinates national fundraising appeals for public 
donations on behalf of its member agencies.

Texts cost £5 plus one standard rate message.  Age International will receive a 
minimum of £4.96.  
More info at ageinternational.org.uk/SyriaTerms



 

---
Age UK is a registered charity and company limited by guarantee, (registered 
charity number 1128267, registered company number 6825798). 
Registered office: Tavis House, 1-6 Tavistock Square, London WC1H 9NA.

For the purposes of promoting Age UK Insurance, Age UK is an Appointed 
Representative of Age UK Enterprises Limited, Age UK is an Introducer 
Appointed Representative of JLT Benefit Solutions Limited and Simplyhealth 
Access for the purposes of introducing potential annuity and health 
cash plans customers respectively.  Age UK Enterprises Limited, JLT Benefit 
Solutions Limited and Simplyhealth Access are all authorised and 
regulated by the Financial Services Authority. 
--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. If you receive a message in error, please advise the sender and 
delete immediately.

Except where this email is sent in the usual course of our business, any 
opinions expressed in this email are those of the author and do not 
necessarily reflect the opinions of Age UK or its subsidiaries and associated 
companies. Age UK monitors all e-mail transmissions passing 
through its network and may block or modify mails which are deemed to be 
unsuitable.

Age Concern England (charity number 261794) and Help the Aged (charity number 
272786) and their trading and other associated companies merged 
on 1st April 2009.  Together they have formed the Age UK Group, dedicated to 
improving the lives of people in later life.  The three national 
Age Concerns in Scotland, Northern Ireland and Wales have also merged with Help 
the Aged in these nations to form three registered charities: 
Age Scotland, Age NI, Age Cymru.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression/Cut point? predict ??

2012-10-20 Thread Simon Knapp
What do you mean by at x equal zero?

On Sun, Oct 21, 2012 at 8:37 AM, Adel Powell powella...@gmail.com wrote:
 I am new to R and I am trying to do a monte carlo simulation where I
 generate data and interject error then test various cut points; however, my
 output was garbage (at x equal zero, I did not get .50)
 I am basically testing the performance of classifiers.

 Here is the code:
 n - 1000; # Sample size

 fitglm - function(sigma,tau){
 x - rnorm(n,0,sigma)
 intercept - 0
 beta - 5
* ystar - intercept+beta*x*
* z - rbinom(n,1,plogis(ystar))**# I believe plogis accepts the a
 +bx augments and return the  e^x/(1+e^x)  which is then used to generate 0
 and 1 data*
 xerr - x + rnorm(n,0,tau)# error is added here
 model-glm(z ~ xerr, family=binomial(logit))
 int-coef(model)[1]
 slope-coef(model)[2]
 pred-predict(model)  #this gives me the a+bx data for new error?  I
 know I can add type= response to get the probab. but only e^x not *e^x/(1+e^x)
 *

 pi1hat-length(z[which(z==1)]/length(z)) My cut point is calculated  is
 the proportion of 0s to 1.
 pi0hat-length(z[which(z==0)]/length(z))

 cutmid - log(pi0hat/pi1hat)
 pred-ifelse(predcutmid,1,0) * I am not sure if I need to compare
 these two. I think this is an error.
 *
 accuracy-length(which(pred==z))/length(z)
 accuracy

 rocpreds-prediction(pred,z)
 auc-performance(rocpreds,auc)@y.values

 output-c(int,slope,cutmid,accuracy,auc)
 names(output)-c(Intercept,Slope,CutPoint,Accuracy,AUC)
 return(output)

 }

 y-fitglm(.05,1)
 y

 nreps - 500;
 output-data.frame(matrix(rep(NA),nreps,6,ncol=6))

 mysigma-.5
 mytau-.1

 i-1

 for(j in 1:nreps) {
 output[j,1:5]-fitglm(mysigma,mytau)
 output[j,6]-j
 }

 names(output)-c(Intercept,Slope,CutPoint,Accuracy,AUC,Iteration)

 apply(output,2, mean)
 apply(output,2, var)

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread Marc Schwartz
On Jul 31, 2012, at 10:35 AM, M Pomati marco.pom...@bristol.ac.uk wrote:

 
 
 Does anyone know of any X^2 tests to compare the fit of logistic models 
 which factor out the sample size? I'm dealing with a very large sample and 
 I fear the significant X^2 test I get when adding a variable to the model 
 is simply a result of the sample size (200,000 cases).
 
 I'd rather use the whole dataset instead of taking (small) random samples 
 as it is highly skewed. I've seen things like Phi and Cramer's V for 
 crosstabs but I'm not sure whether they have been used before on logistic 
 regression, if there are better ones and if there are any packages.
 
 
 Many thanks
 
 Marco



Sounds like you are bordering on some type of stepwise approach to including or 
not including covariates in the model. You can search the list archives for a 
myriad of discussions as to why that is a poor approach.

You have the luxury of a large sample. You also have the challenge of 
interpreting covariates that appear to be statistically significant, but may 
have a rather small *effect size* in context. That is where subject matter 
experts need to provide input as to interpretation of the contextual 
significance of the variable, as opposed to the statistical significance of 
that same variable.

A general approach, is to simply pre-specify your model based upon rather 
simple considerations. Also, you need to determine if your goal for the model 
is prediction or explanation. 

What is the incidence of your 'event' in the sample? If it is say 10%, then you 
should have around 20,000 events. The rule of thumb for logistic regression is 
to have around 20 events per covariate degree of freedom (df) to minimize the 
risk of over-fitting the model to your dataset. A continuous covariate is 1 df, 
a k-level factor is k-1 df. So with 20,000 events, your model could feasibly 
have 1,000 covariate df's. I am guessing that you don't have that much 
independent data to begin with.

So, pre-specfy your model on the full dataset and stick with it. Interact with 
subject matter experts on the interpretation of the model.

BTW, this question is really about statistical modeling generally, not really R 
specific. Such queries are best posed to general statistical lists/forums such 
as Stack Exchange. I would also point you to Frank Harrell's book, Regression 
Modeling Strategies.

Regards,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread M Pomati
Marc, thank you very much for your help.
I've posted in on

http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models

and added details.

Many thanks

Marco

--On 31 July 2012 11:50 -0500 Marc Schwartz marc_schwa...@me.com wrote:

 On Jul 31, 2012, at 10:35 AM, M Pomati marco.pom...@bristol.ac.uk wrote:

 
 
  Does anyone know of any X^2 tests to compare the fit of logistic models
  which factor out the sample size? I'm dealing with a very large sample 
and
  I fear the significant X^2 test I get when adding a variable to the 
model
  is simply a result of the sample size (200,000 cases).
 
  I'd rather use the whole dataset instead of taking (small) random 
samples
  as it is highly skewed. I've seen things like Phi and Cramer's V for
  crosstabs but I'm not sure whether they have been used before on 
logistic
  regression, if there are better ones and if there are any packages.
 
 
  Many thanks
 
  Marco



 Sounds like you are bordering on some type of stepwise approach to 
including or not including covariates in the model. You can search the list 
archives for a myriad of discussions as to why that is a poor approach.

 You have the luxury of a large sample. You also have the challenge of 
interpreting covariates that appear to be statistically significant, but 
may have a rather small *effect size* in context. That is where subject 
matter experts need to provide input as to interpretation of the contextual 
significance of the variable, as opposed to the statistical significance of 
that same variable.

 A general approach, is to simply pre-specify your model based upon rather 
simple considerations. Also, you need to determine if your goal for the 
model is prediction or explanation.

 What is the incidence of your 'event' in the sample? If it is say 10%, 
then you should have around 20,000 events. The rule of thumb for logistic 
regression is to have around 20 events per covariate degree of freedom (df) 
to minimize the risk of over-fitting the model to your dataset. A 
continuous covariate is 1 df, a k-level factor is k-1 df. So with 20,000 
events, your model could feasibly have 1,000 covariate df's. I am guessing 
that you don't have that much independent data to begin with.

 So, pre-specfy your model on the full dataset and stick with it. Interact 
with subject matter experts on the interpretation of the model.

 BTW, this question is really about statistical modeling generally, not 
really R specific. Such queries are best posed to general statistical 
lists/forums such as Stack Exchange. I would also point you to Frank 
Harrell's book, Regression Modeling Strategies.

 Regards,

 Marc Schwartz






--
M Pomati
University of Bristol
School for Policy Studies
8 Priory Road
Office:10B
Bristol BS8 1TZ, UK
http://www.bristol.ac.uk/sps/research/centres/poverty

 
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression X^2 test with large sample size (fwd)

2012-07-31 Thread David Winsemius


On Jul 31, 2012, at 10:25 AM, M Pomati wrote:


Marc, thank you very much for your help.
I've posted in on

http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models 



and added details.


I think you might have gotten a more statistically knowledgeable  
audience at:


http://stats.stackexchange.com/

(And I suggested to the moderators at math-SE that it be migrated.)

--
David.



Many thanks

Marco

--On 31 July 2012 11:50 -0500 Marc Schwartz marc_schwa...@me.com  
wrote:


On Jul 31, 2012, at 10:35 AM, M Pomati marco.pom...@bristol.ac.uk  
wrote:


Does anyone know of any X^2 tests to compare the fit of logistic  
models
which factor out the sample size? I'm dealing with a very large  
sample and
I fear the significant X^2 test I get when adding a variable to  
the model

is simply a result of the sample size (200,000 cases).

I'd rather use the whole dataset instead of taking (small) random  
samples

as it is highly skewed. I've seen things like Phi and Cramer's V for
crosstabs but I'm not sure whether they have been used before on  
logistic

regression, if there are better ones and if there are any packages.


Many thanks

Marco



Sounds like you are bordering on some type of stepwise approach to
including or not including covariates in the model. You can search  
the list
archives for a myriad of discussions as to why that is a poor  
approach.


You have the luxury of a large sample. You also have the challenge of
interpreting covariates that appear to be statistically significant,  
but
may have a rather small *effect size* in context. That is where  
subject
matter experts need to provide input as to interpretation of the  
contextual
significance of the variable, as opposed to the statistical  
significance of

that same variable.


A general approach, is to simply pre-specify your model based upon  
rather
simple considerations. Also, you need to determine if your goal for  
the

model is prediction or explanation.


What is the incidence of your 'event' in the sample? If it is say  
10%,
then you should have around 20,000 events. The rule of thumb for  
logistic
regression is to have around 20 events per covariate degree of  
freedom (df)

to minimize the risk of over-fitting the model to your dataset. A
continuous covariate is 1 df, a k-level factor is k-1 df. So with  
20,000
events, your model could feasibly have 1,000 covariate df's. I am  
guessing

that you don't have that much independent data to begin with.


So, pre-specfy your model on the full dataset and stick with it.  
Interact

with subject matter experts on the interpretation of the model.


BTW, this question is really about statistical modeling generally,  
not

really R specific. Such queries are best posed to general statistical
lists/forums such as Stack Exchange. I would also point you to Frank
Harrell's book, Regression Modeling Strategies.


Regards,

Marc Schwartz


--
M Pomati
University of Bristol




David Winsemius, MD
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-04 Thread David Winsemius


On Apr 3, 2012, at 9:25 PM, Melrose2012 wrote:

I am trying to plot the logistic regression of a dataset (# of  
living flies
vs days the flies are alive) and then fit a best-fit line to this  
data.


Here is my code:
plot(fflies$living~fflies$day,xlab=Number of Days,ylab=Number of  
Fruit

Flies,main=Number of Living Fruit Flies vs Day,pch=16)
alive - (fflies$living)
dead - (fflies$living[1]-alive)
glm.fit - glm(cbind(alive,dead)~fflies$day,family=binomial)
summary(glm.fit)
lines(sort(fflies$day),fitted(glm.fit) [order (fflies$day)],col =
red,lwd=2)
lines(glm.fit,col = red,lwd=2)

My problem is that, while I am pretty sure that I did the 'glm'  
command
correctly, when I try to plot this as a best-fit line on my data, it  
does

not fit (clearly on a different scale somehow).

Can anyone enlighten me about what I am doing wrong?  Should I be  
scaling
one of these somehow?  I've tried various types of scaling but  
nothing plots

the line on top of the plot (no matter which I scale).


The parameter estimate is on the log(odds) scale. Two problems: A)  
Your plot is not even on the odds scale, much less the log(odds)  
scale. B) Your model assumes that the log(odds) for an event will  
increase linearly with days. However your event == alive seems  
reversed from what I would have expected it to be. I would have  
expected dead within an interval to be the event that might increase  
in probability as time increased. I also speculated that the problem  
(which was not described at all) might better fit within the framework  
of survival analysis rather than logistic regression?


--
David.



Thanks so much for whatever help you can give me!

Cheers,
Melissa

--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tp4530651p4530651.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-03 Thread David Winsemius


On Apr 3, 2012, at 9:25 PM, Melrose2012 wrote:

I am trying to plot the logistic regression of a dataset (# of  
living flies
vs days the flies are alive) and then fit a best-fit line to this  
data.


Here is my code:
plot(fflies$living~fflies$day,xlab=Number of Days,ylab=Number of  
Fruit

Flies,main=Number of Living Fruit Flies vs Day,pch=16)
alive - (fflies$living)
dead - (fflies$living[1]-alive)
glm.fit - glm(cbind(alive,dead)~fflies$day,family=binomial)
summary(glm.fit)
lines(sort(fflies$day),fitted(glm.fit) [order (fflies$day)],col =
red,lwd=2)
lines(glm.fit,col = red,lwd=2)


Is this homework?



PLEASE do read the posting guide http://www.R-project.org/posting-guide.html


--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-03 Thread Melrose2012
No, this is related to my own data.

Yes, 'fflies' the dataset - here I am working with two columns:  # of fruit
flies alive vs # of days these flies are alive.

There is no error, it's just that the best-fit line does not plot nicely on
top of my data (see figure attached).
http://r.789695.n4.nabble.com/file/n4530804/fflies1.jpeg 

I then tried to log the raw data as follows and plot:
log.living - log(fflies$living)
plot(log.living~fflies$day,xlab=Number of Days,ylab=Number of Fruit
Flies,main=Number of Living Fruit Flies vs Day,pch=16)
http://r.789695.n4.nabble.com/file/n4530804/fflies2.jpeg 

Any help you can give me would be greatly appreciated.  I am new to R, so
I'm sorry for my beginner's ignorance in learning this program!  :-)

Cheers,
Melissa




--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tp4530651p4530804.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-03 Thread Bert Gunter
Get help. You do not understand glm's. What do you think the fitted
values are? -- Hint: they are *not* an estimate of the number of
living fruit flies.

-- Bert

On Tue, Apr 3, 2012 at 6:25 PM, Melrose2012
melissa.patric...@stonybrook.edu wrote:
 I am trying to plot the logistic regression of a dataset (# of living flies
 vs days the flies are alive) and then fit a best-fit line to this data.

 Here is my code:
 plot(fflies$living~fflies$day,xlab=Number of Days,ylab=Number of Fruit
 Flies,main=Number of Living Fruit Flies vs Day,pch=16)
 alive - (fflies$living)
 dead - (fflies$living[1]-alive)
 glm.fit - glm(cbind(alive,dead)~fflies$day,family=binomial)
 summary(glm.fit)
 lines(sort(fflies$day),fitted(glm.fit) [order (fflies$day)],col =
 red,lwd=2)
 lines(glm.fit,col = red,lwd=2)

 My problem is that, while I am pretty sure that I did the 'glm' command
 correctly, when I try to plot this as a best-fit line on my data, it does
 not fit (clearly on a different scale somehow).

 Can anyone enlighten me about what I am doing wrong?  Should I be scaling
 one of these somehow?  I've tried various types of scaling but nothing plots
 the line on top of the plot (no matter which I scale).

 Thanks so much for whatever help you can give me!

 Cheers,
 Melissa

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/logistic-regression-tp4530651p4530651.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-03 Thread Melrose2012
Thank you for your reply.  

I do understand that I am working in log space with the default link
function of the binomial being logit.  My problem is, I thought that they
way I had written the code, when I did the lines command, it should plot
the best-fit line (found by 'glm') on top of my graph.  As you can see,
however, this clearly did not work and I am off by several factors of
magnitude.  I am not sure if the problem lies in the way I have done the
'glm' or in the plotting.  

If I understand the function correctly, when I do 'glm' of family=binomial,
I am working in log space (hence the link function default being logit.) 
If this is correct, then doesn't either my data have to be converted to log
space also or the results of my 'glm.fit' be converted to linear space (i.e.
using 'inv.logit' from the boot library)?  This was my thought process
behind plotting the data as log (I tried several other obviously useless
iterations of various scalings.)  It appears that you can not do an
inv.logit on the result from the 'glm' command.

If I try to do this:   inv.logit(glm.fit)  
I get an error saying:  Error in plogis(q, location, scale, lower.tail,
log.p) : Non-numeric argument to mathematical function

I'm pretty sure this is a limitation in my understanding of exactly what R
is doing in this logistic regression and I am having a hard time finding
help online or in the R help files that helps me to decipher this problem.  

Can anyone tell me if I have set up the 'glm' wrong or if the error is in
how I am trying to plot the best-fit line on the data?

I am new to these sorts of statistics and data manipulation, so I
acknowledge my ignorance and appreciate whatever help anyone can give me. 

Thank you again!

Cheers,
Melissa

--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tp4530651p4530959.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-03 Thread Jorge I Velez
Hi Melissa,

I would highly encourage you to read [1].  It would be extremely beneficial
for your understanding of the type of models you should use in a situation
like yours (count data).

Best regards,
Jorge.-

[1]  cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf



On Wed, Apr 4, 2012 at 1:10 AM, Melrose2012  wrote:

 Thank you for your reply.

 I do understand that I am working in log space with the default link
 function of the binomial being logit.  My problem is, I thought that they
 way I had written the code, when I did the lines command, it should plot
 the best-fit line (found by 'glm') on top of my graph.  As you can see,
 however, this clearly did not work and I am off by several factors of
 magnitude.  I am not sure if the problem lies in the way I have done the
 'glm' or in the plotting.

 If I understand the function correctly, when I do 'glm' of family=binomial,
 I am working in log space (hence the link function default being logit.)
 If this is correct, then doesn't either my data have to be converted to log
 space also or the results of my 'glm.fit' be converted to linear space
 (i.e.
 using 'inv.logit' from the boot library)?  This was my thought process
 behind plotting the data as log (I tried several other obviously useless
 iterations of various scalings.)  It appears that you can not do an
 inv.logit on the result from the 'glm' command.

 If I try to do this:   inv.logit(glm.fit)
 I get an error saying:  Error in plogis(q, location, scale, lower.tail,
 log.p) : Non-numeric argument to mathematical function

 I'm pretty sure this is a limitation in my understanding of exactly what R
 is doing in this logistic regression and I am having a hard time finding
 help online or in the R help files that helps me to decipher this problem.

 Can anyone tell me if I have set up the 'glm' wrong or if the error is in
 how I am trying to plot the best-fit line on the data?

 I am new to these sorts of statistics and data manipulation, so I
 acknowledge my ignorance and appreciate whatever help anyone can give me.

 Thank you again!

 Cheers,
 Melissa

 --
 View this message in context:
 http://r.789695.n4.nabble.com/logistic-regression-tp4530651p4530959.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression

2012-04-03 Thread Jack Tanner
Melrose2012 melissa.patrician at stonybrook.edu writes:

 
 alive - (fflies$living)
 dead - (fflies$living[1]-alive)
 glm.fit - glm(cbind(alive,dead)~fflies$day,family=binomial)
 summary(glm.fit)

Your call to glm() is not doing what you think it's doing. What you want to do
is probably closer to 

glm.fit - glm(living ~ day, data=fflies, family=binomial)

Where ffiles$living has exactly two values, e.g., alive or dead or some
other pair.

After you do this, you want to examine glm.fit very carefully. Make sure you can
interpret all of the output of summary(glm.fit). Until you do, there's no point
in trying to plot.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression

2012-03-28 Thread Sarah Goslee
On Wed, Mar 28, 2012 at 11:49 AM, carlb1 carl19...@hotmail.com wrote:
 Hi

 does anyone know how to do a logistic regression in R?

I do. And doubtless many other folks on R-help do too.

 any help appreciated

You should probably start here:
http://www.rseek.org/?cx=010923144343702598753%3Aboaz1reyxd4q=logistic+regressionsa=Search+functions%2C+lists%2C+and+morecof=FORID%3A11siteurl=rseek.org%2Fref=

And then move on to here:
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2012-02-07 Thread Terry Therneau
I'm surprised not to see the simple answer: glm models return the MLE
estimate.
fit - glm(y ~ x1 + x2 +  , family='binomial')

There is no need for special packages, this is a standard part of R.

Terry Therneau

 begin included message --
On 02/06/2012 03:08 PM, Ana wrote:
 I am looking for R packages that can make a Logistic Regression model
 with parameter estimation by Maximum Likelihood Estimation.


 Many thanks for helping out.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
Hi Ana,

I worked out some maximum likelihood estimates for logistic regression
in
R on my website: http://www.netstorm.be/home/lrm

Best regards

Thomas

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2012-02-06 Thread Sarah Goslee
Hi,

On Mon, Feb 6, 2012 at 10:08 AM, Ana rrast...@gmail.com wrote:
 I am looking for R packages that can make a Logistic Regression model
 with parameter estimation by Maximum Likelihood Estimation.

How are you looking? Did you perhaps try Google

https://www.google.com/#q=r-help+Logistic+Regression+Maximum+Likelihood+Estimation

or rseek.org

http://www.rseek.org/?cx=010923144343702598753%3Aboaz1reyxd4newwindow=1q=Logistic+Regression+Maximum+Likelihood+Estimationsa=Search+functions%2C+lists%2C+and+morecof=FORID%3A11siteurl=rseek.org%2F

or one of the other R-centric search functions out there, like the
ones built right into R?

How did what you found not meet your needs?

Sarah


 Many thanks for helping out.



-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2012-02-06 Thread tw
On 02/06/2012 03:08 PM, Ana wrote:
 I am looking for R packages that can make a Logistic Regression model
 with parameter estimation by Maximum Likelihood Estimation.


 Many thanks for helping out.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
Hi Ana,

I worked out some maximum likelihood estimates for logistic regression in
R on my website: http://www.netstorm.be/home/lrm

Best regards

Thomas

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression with genetic component

2011-12-04 Thread Ben Bolker
Danielle Duncan dlduncan2 at alaska.edu writes:

 Greetings, I have a question that I'd like to get input on. I have a
 classic toxicology study where I artificially fertilized and exposed
 embryos to a chemical and counted defects. In addition, I kept track of
 male-female pairs that I used to artificially fertilize and generate
 embryos with. I need to use logistic regression to model the response, but
 also check that the genetics of the pairings did or did not have an effect
 on the response. My data looks a bit like this:
 
 response matrix chemical concentration  Genetic Matrix
 Present AbsentMale Female
 2 152   0.13 a 1
 6 121  1 a 2
 21 92  2 b 3
 24 89  5 b 4
 0141 10 c 5
 5 95 15 c  6
 
 R code:
 
 DA-cbind(Present, Absent)
 glm-(DA ~ chemical concentration)
 
 If I do glm-(DA ~ chemical concentration + Male + Female, I get every
 possible combination, but I only want specific pairs. So, I am thinking
 about doing:
 
 MF-cbind(Male, Female)
 glm-(DA ~ chemical concentration + MF)


You're on the right track.  paste() is probably what
you want, although you can also use interaction() to
get the interactions and then droplevels() to get
rid of the unobserved crosses.

d - read.table(textConnection(
Present Absent   conc   Male Female
2 152   0.13 a 1
6 121  1 a 2
21 92  2 b 3
24 89  5 b 4
0141 10 c 5
5 95 15 c  6),
header=TRUE)

Either of these should give you what you want:
  
d - droplevels(transform(d,cross=interaction(Male,Female)))
levels(d$cross)

d - transform(d,cross=paste(Male,Female,sep=.))
levels(d$cross)

You should be a little careful -- if each cross is exposed
only to a single concentration, and if you treat cross as
a fixed effect, you will overparameterize your model.  If
you treat it as a random effect, e.g. using glmer in the
lme4 package:

glmer(cbind(Present,Absent)~conc+(1|cross),data=d)

you will effectively be fitting a model for overdispersion
(see the example in ?cbpp).

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-02 Thread Patrick Breheny

On 12/01/2011 08:00 PM, Ben quant wrote:

The data I am using is the last file called l_yx.RData at this link (the
second file contains the plots from earlier):
http://scientia.crescat.net/static/ben/


The logistic regression model you are fitting assumes a linear 
relationship between x and the log odds of y; that does not seem to be 
the case for your data.  To illustrate:


x - l_yx[,x]
y - l_yx[,y]
ind1 - x = .002
ind2 - (x  .002  x = .0065)
ind3 - (x  .0065  x = .13)
ind4 - (x  .0065  x = .13)

 summary(glm(y[ind1]~x[ind1],family=binomial))
...
Coefficients:
 Estimate Std. Error z value Pr(|z|)
(Intercept)  -2.791740.02633 -106.03   2e-16 ***
x[ind1] 354.98852   22.78190   15.58   2e-16 ***

 summary(glm(y[ind2]~x[ind2],family=binomial))
Coefficients:
 Estimate Std. Error z value Pr(|z|)
(Intercept)  -2.158050.02966 -72.766   2e-16 ***
x[ind2] -59.929346.51650  -9.197   2e-16 ***

 summary(glm(y[ind3]~x[ind3],family=binomial))
...
Coefficients:
 Estimate Std. Error z value Pr(|z|)
(Intercept) -2.367206   0.007781 -304.22   2e-16 ***
x[ind3] 18.104314   0.346562   52.24   2e-16 ***

 summary(glm(y[ind4]~x[ind4],family=binomial))
...
Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) -1.315110.08549 -15.383   2e-16 ***
x[ind4]  0.062610.08784   0.7130.476

To summarize, the relationship between x and the log odds of y appears 
to vary dramatically in both magnitude and direction depending on which 
interval of x's range we're looking at.  Trying to summarize this 
complicated pattern with a single line is leading to the fitted 
probabilities near 0 and 1 you are observing (note that only 0.1% of the 
data is in region 4 above, although region 4 accounts for 99.1% of the 
range of x).


--
Patrick Breheny
Assistant Professor
Department of Biostatistics
Department of Statistics
University of Kentucky

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread peter dalgaard

On Dec 1, 2011, at 18:54 , Ben quant wrote:

 Sorry if this is a duplicate: This is a re-post because the pdf's mentioned
 below did not go through.

Still not there. Sometimes it's because your mailer doesn't label them with the 
appropriate mime-type (e.g. as application/octet-stream, which is arbitrary 
binary). Anyways, see below

[snip]
 
 With the above data I do:
l_logit = glm(y~x, data=as.data.frame(l_yx),
 family=binomial(link=logit))
 Warning message:
 glm.fit: fitted probabilities numerically 0 or 1 occurred
 
 Why am I getting this warning when I have data points of varying values for
 y=1 and y=0?  In other words, I don't think I have the linear separation
 issue discussed in one of the links I provided.

I bet that you do... You can get the warning without that effect (one of my own 
examples is  the probability of menarche in a data set that includes infants 
and old age pensioners), but not with a huge odds ratio as well. Take a look at 

d - as.data.frame(l_yx) 
with(d, y[order(x)])

if it comes out as all zeros followed by all ones or vice versa, then you have 
the problem.


 
 PS - Then I do this and I get a odds ratio a crazy size:
l_sm = summary(l_logit) # coef pval is $coefficients[8], log odds
 $coefficients[2]
l_exp_coef = exp(l_logit$coefficients)[2] # exponentiate the
 coeffcients
l_exp_coef
   x
 3161.781
 
 So for one unit increase in the predictor variable I get 3160.781%
 (3161.781 - 1 = 3160.781) increase in odds? That can't be correct either.
 How do I correct for this issue? (I tried multiplying the predictor
 variables by a constant and the odds ratio goes down, but the warning above
 still persists and shouldn't the odds ratio be predictor variable size
 independent?)


-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread Ben quant
Thank you for the feedback, but my data looks fine to me. Please tell me if
I'm not understanding.

I followed your instructions and here is a sample of the first 500 values :
(info on 'd' is below that)

  d - as.data.frame(l_yx)
 x = with(d, y[order(x)])
 x[1:500] # I have 1's and 0's dispersed throughout
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
[101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0
[301] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

# I get the warning still
 l_df = as.data.frame(l_yx)
 l_logit = glm(y~x, data=l_df, family=binomial(link=logit))

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

# some info on 'd' above:

 d[1:500,1]
  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[301] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 d[1:500,2]
  [1] 3.023160e-03 7.932130e-02 0.00e+00 1.779657e-02 1.608374e-01
0.00e+00 5.577064e-02 7.753926e-03 4.018553e-03 4.760918e-02
2.080511e-02 1.642404e-01 3.703720e-03 8.901981e-02 1.260415e-03
 [16] 2.202523e-02 3.750940e-02 4.441975e-04 9.351171e-03 8.374567e-03
0.00e+00 8.440448e-02 5.081017e-01 2.538640e-05 1.806017e-02
2.954641e-04 1.434859e-03 6.964976e-04 0.00e+00 1.202162e-02
 [31] 3.420300e-03 4.276100e-02 1.457324e-02 4.140121e-03 1.349180e-04
1.525292e-03 4.817502e-02 9.515717e-03 2.232953e-02 1.227908e-01
3.293581e-02 1.454352e-02 1.176011e-03 6.274138e-02 2.879205e-02
 [46] 6.900926e-03 1.414648e-04 3.446349e-02 8.807174e-03 3.549332e-02
2.828509e-03 2.935003e-02 7.162872e-03 5.650050e-03 1.221191e-02
0.00e+00 2.126334e-02 2.052020e-02 7.542409e-02 2.586269e-04
 [61] 5.258664e-02 1.213126e-02 1.493275e-02 8.152657e-03 1.774966e-02
2.433198e-02 1.371257e-02 6.732339e-02 9.747677e-03 8.180594e-03
1.882473e-01 1.682596e-03 1.410170e-02 8.429676e-03 3.520589e-01
 [76] 1.865160e-02 2.389941e-02 3.301558e-02 4.896605e-03 6.826244e-03
9.151337e-02 1.722270e-02 1.764039e-01 3.554063e-02 7.338532e-03
9.577793e-03 3.632366e-03 1.422400e-02 3.020497e-02 1.797096e-02
 [91] 2.577899e-03 1.338365e-01 8.202223e-03 1.088335e-02 2.137259e-02
6.829797e-03 3.104860e-04 8.927160e-03 5.626881e-02 1.197682e-02
6.140553e-02 2.729364e-02 1.377913e-02 3.018277e-03 1.188304e-02
[106] 2.029268e-03 2.108815e-02 1.765503e-03 2.449253e-02 3.677046e-03
1.013463e-02 4.286642e-02 1.238931e-02 3.072090e-03 1.509613e-02
3.609885e-02 5.537268e-03 3.151614e-02 3.703148e-03 1.409401e-01
[121] 1.473365e-02 9.160461e-03 1.035099e-01 3.005865e-02 1.332199e-02
6.936535e-03 2.433539e-02 4.935373e-03 4.822740e-03 1.597643e-03
3.805619e-03 1.092683e-02 1.760635e-01 5.565614e-03 7.739213e-04
[136] 4.529198e-03 5.110359e-03 7.391258e-02 5.018207e-03 2.071417e-02
1.825787e-02 9.141405e-03 1.078386e-01 2.171470e-04 7.164583e-03
2.522077e-02 1.994428e-03 6.044653e-03 1.778078e-04 5.797706e-03
[151] 1.526778e-02 1.595897e-02 1.995627e-01 1.946735e-03 6.711582e-02
2.722350e-02 3.127499e-02 1.074904e-01 2.477414e-03 5.015375e-02
3.417810e-02 2.046643e-02 1.644832e-02 5.220166e-03 1.217752e-02
[166] 1.187352e-02 1.634016e-02 2.349568e-03 3.451811e-02 2.593540e-03
6.868595e-03 1.311236e-02 6.457757e-03 2.942391e-04 1.628352e-02
8.288831e-03 3.170856e-04 

Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread peter dalgaard

On Dec 1, 2011, at 21:32 , Ben quant wrote:

 Thank you for the feedback, but my data looks fine to me. Please tell me if 
 I'm not understanding.

Hum, then maybe it really is a case of a transition region being short relative 
to the range of your data. Notice that the warning is just that: a warning. I 
do notice that the distribution of your x values is rather extreme -- you 
stated a range of 0--14 and a mean of 0.01. And after all, an odds ratio of 
3000 per unit is only a tad over a doubling per 0.1 units.

Have a look at  range(x[y==0]) and range(x[y==1]). 


 
 I followed your instructions and here is a sample of the first 500 values : 
 (info on 'd' is below that)
 
   d - as.data.frame(l_yx) 
  x = with(d, y[order(x)])
  x[1:500] # I have 1's and 0's dispersed throughout 
   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
 [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 
 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 
 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0
 [301] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 
 # I get the warning still
  l_df = as.data.frame(l_yx)
  l_logit = glm(y~x, data=l_df, family=binomial(link=logit))
 
 Warning message:
 glm.fit: fitted probabilities numerically 0 or 1 occurred 
 
 # some info on 'd' above:
 
  d[1:500,1]
   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [301] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  d[1:500,2]
   [1] 3.023160e-03 7.932130e-02 0.00e+00 1.779657e-02 1.608374e-01 
 0.00e+00 5.577064e-02 7.753926e-03 4.018553e-03 4.760918e-02 2.080511e-02 
 1.642404e-01 3.703720e-03  8.901981e-02 1.260415e-03
  [16] 2.202523e-02 3.750940e-02 4.441975e-04 9.351171e-03 8.374567e-03 
 0.00e+00 8.440448e-02 5.081017e-01 2.538640e-05 1.806017e-02 2.954641e-04 
 1.434859e-03 6.964976e-04 0.00e+00 1.202162e-02
  [31] 3.420300e-03 4.276100e-02 1.457324e-02 4.140121e-03 1.349180e-04 
 1.525292e-03 4.817502e-02 9.515717e-03 2.232953e-02 1.227908e-01 3.293581e-02 
 1.454352e-02 1.176011e-03 6.274138e-02 2.879205e-02
  [46] 6.900926e-03 1.414648e-04 3.446349e-02 8.807174e-03 3.549332e-02 
 2.828509e-03 2.935003e-02 7.162872e-03 5.650050e-03 1.221191e-02 0.00e+00 
 2.126334e-02 2.052020e-02 7.542409e-02 2.586269e-04
  [61] 5.258664e-02 1.213126e-02 1.493275e-02 8.152657e-03 1.774966e-02 
 2.433198e-02 1.371257e-02 6.732339e-02 9.747677e-03 8.180594e-03 1.882473e-01 
 1.682596e-03 1.410170e-02 8.429676e-03 3.520589e-01
  [76] 1.865160e-02 2.389941e-02 3.301558e-02 4.896605e-03 6.826244e-03 
 9.151337e-02 1.722270e-02 1.764039e-01 3.554063e-02 7.338532e-03 9.577793e-03 
 3.632366e-03 1.422400e-02 3.020497e-02 1.797096e-02
  [91] 2.577899e-03 1.338365e-01 8.202223e-03 1.088335e-02 2.137259e-02 
 6.829797e-03 3.104860e-04 8.927160e-03 5.626881e-02 1.197682e-02 6.140553e-02 
 2.729364e-02 1.377913e-02 3.018277e-03 1.188304e-02
 [106] 2.029268e-03 2.108815e-02 1.765503e-03 2.449253e-02 3.677046e-03 
 1.013463e-02 4.286642e-02 1.238931e-02 3.072090e-03 1.509613e-02 3.609885e-02 
 5.537268e-03 3.151614e-02 3.703148e-03 1.409401e-01
 [121] 1.473365e-02 9.160461e-03 1.035099e-01 3.005865e-02 1.332199e-02 
 6.936535e-03 2.433539e-02 4.935373e-03 4.822740e-03 1.597643e-03 3.805619e-03 
 1.092683e-02 1.760635e-01 

Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread Ben quant
Here you go:

 attach(as.data.frame(l_yx))
 range(x[y==1])
[1] -22500.746.
  range(x[y==0])
[1] -10076.5303653.0228

How do I know what is acceptable?

Also, here are the screen shots of my data that I tried to send earlier
(two screen shots, two pages):
http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf

Thank you,

Ben

On Thu, Dec 1, 2011 at 2:24 PM, peter dalgaard pda...@gmail.com wrote:


 On Dec 1, 2011, at 21:32 , Ben quant wrote:

  Thank you for the feedback, but my data looks fine to me. Please tell me
 if I'm not understanding.

 Hum, then maybe it really is a case of a transition region being short
 relative to the range of your data. Notice that the warning is just that: a
 warning. I do notice that the distribution of your x values is rather
 extreme -- you stated a range of 0--14 and a mean of 0.01. And after all,
 an odds ratio of 3000 per unit is only a tad over a doubling per 0.1 units.

 Have a look at  range(x[y==0]) and range(x[y==1]).


 
  I followed your instructions and here is a sample of the first 500
 values : (info on 'd' is below that)
 
d - as.data.frame(l_yx)
   x = with(d, y[order(x)])
   x[1:500] # I have 1's and 0's dispersed throughout
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0
  [301] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 
  # I get the warning still
   l_df = as.data.frame(l_yx)
   l_logit = glm(y~x, data=l_df, family=binomial(link=logit))
 
  Warning message:
  glm.fit: fitted probabilities numerically 0 or 1 occurred
 
  # some info on 'd' above:
 
   d[1:500,1]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [301] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   d[1:500,2]
[1] 3.023160e-03 7.932130e-02 0.00e+00 1.779657e-02 1.608374e-01
 0.00e+00 5.577064e-02 7.753926e-03 4.018553e-03 4.760918e-02
 2.080511e-02 1.642404e-01 3.703720e-03  8.901981e-02 1.260415e-03
   [16] 2.202523e-02 3.750940e-02 4.441975e-04 9.351171e-03 8.374567e-03
 0.00e+00 8.440448e-02 5.081017e-01 2.538640e-05 1.806017e-02
 2.954641e-04 1.434859e-03 6.964976e-04 0.00e+00 1.202162e-02
   [31] 3.420300e-03 4.276100e-02 1.457324e-02 4.140121e-03 1.349180e-04
 1.525292e-03 4.817502e-02 9.515717e-03 2.232953e-02 1.227908e-01
 3.293581e-02 1.454352e-02 1.176011e-03 6.274138e-02 2.879205e-02
   [46] 6.900926e-03 1.414648e-04 3.446349e-02 8.807174e-03 3.549332e-02
 2.828509e-03 2.935003e-02 7.162872e-03 5.650050e-03 1.221191e-02
 0.00e+00 2.126334e-02 2.052020e-02 7.542409e-02 2.586269e-04
   [61] 5.258664e-02 1.213126e-02 1.493275e-02 8.152657e-03 1.774966e-02
 2.433198e-02 1.371257e-02 6.732339e-02 9.747677e-03 8.180594e-03
 1.882473e-01 1.682596e-03 1.410170e-02 8.429676e-03 3.520589e-01
   [76] 1.865160e-02 2.389941e-02 3.301558e-02 4.896605e-03 6.826244e-03
 9.151337e-02 1.722270e-02 1.764039e-01 3.554063e-02 7.338532e-03
 9.577793e-03 3.632366e-03 1.422400e-02 3.020497e-02 1.797096e-02
   [91] 2.577899e-03 1.338365e-01 8.202223e-03 1.088335e-02 2.137259e-02
 6.829797e-03 3.104860e-04 8.927160e-03 5.626881e-02 1.197682e-02
 6.140553e-02 2.729364e-02 

Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread Ben quant
Oops! Please ignore my last post. I mistakenly gave you different data I
was testing with. This is the correct data:

Here you go:

 attach(as.data.frame(l_yx))
  range(x[y==0])
[1]  0.0 14.66518
 range(x[y==1])
[1]  0.0 13.49791

How do I know what is acceptable?

Also, here are the screen shots of my data that I tried to send earlier
(two screen shots, two pages):
http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf

Thank you,

Ben

On Thu, Dec 1, 2011 at 3:07 PM, Ben quant ccqu...@gmail.com wrote:

 Here you go:

  attach(as.data.frame(l_yx))
  range(x[y==1])
 [1] -22500.746.
   range(x[y==0])
 [1] -10076.5303653.0228

 How do I know what is acceptable?

 Also, here are the screen shots of my data that I tried to send earlier
 (two screen shots, two pages):
 http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf

 Thank you,

 Ben


 On Thu, Dec 1, 2011 at 2:24 PM, peter dalgaard pda...@gmail.com wrote:


 On Dec 1, 2011, at 21:32 , Ben quant wrote:

  Thank you for the feedback, but my data looks fine to me. Please tell
 me if I'm not understanding.

 Hum, then maybe it really is a case of a transition region being short
 relative to the range of your data. Notice that the warning is just that: a
 warning. I do notice that the distribution of your x values is rather
 extreme -- you stated a range of 0--14 and a mean of 0.01. And after all,
 an odds ratio of 3000 per unit is only a tad over a doubling per 0.1 units.

 Have a look at  range(x[y==0]) and range(x[y==1]).


 
  I followed your instructions and here is a sample of the first 500
 values : (info on 'd' is below that)
 
d - as.data.frame(l_yx)
   x = with(d, y[order(x)])
   x[1:500] # I have 1's and 0's dispersed throughout
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0
  [301] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 
  # I get the warning still
   l_df = as.data.frame(l_yx)
   l_logit = glm(y~x, data=l_df, family=binomial(link=logit))
 
  Warning message:
  glm.fit: fitted probabilities numerically 0 or 1 occurred
 
  # some info on 'd' above:
 
   d[1:500,1]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [301] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   d[1:500,2]
[1] 3.023160e-03 7.932130e-02 0.00e+00 1.779657e-02 1.608374e-01
 0.00e+00 5.577064e-02 7.753926e-03 4.018553e-03 4.760918e-02
 2.080511e-02 1.642404e-01 3.703720e-03  8.901981e-02 1.260415e-03
   [16] 2.202523e-02 3.750940e-02 4.441975e-04 9.351171e-03 8.374567e-03
 0.00e+00 8.440448e-02 5.081017e-01 2.538640e-05 1.806017e-02
 2.954641e-04 1.434859e-03 6.964976e-04 0.00e+00 1.202162e-02
   [31] 3.420300e-03 4.276100e-02 1.457324e-02 4.140121e-03 1.349180e-04
 1.525292e-03 4.817502e-02 9.515717e-03 2.232953e-02 1.227908e-01
 3.293581e-02 1.454352e-02 1.176011e-03 6.274138e-02 2.879205e-02
   [46] 6.900926e-03 1.414648e-04 3.446349e-02 8.807174e-03 3.549332e-02
 2.828509e-03 2.935003e-02 7.162872e-03 5.650050e-03 1.221191e-02
 0.00e+00 2.126334e-02 2.052020e-02 7.542409e-02 2.586269e-04
   [61] 5.258664e-02 1.213126e-02 

Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread Ben quant
I'm not proposing this as a permanent solution, just investigating the
warning. I zeroed out the three outliers and received no warning. Can
someone tell me why I am getting no warning now?

I did this 3 times to get rid of the 3 outliers:
mx_dims = arrayInd(which.max(l_yx), dim(l_yx))
l_yx[mx_dims] = 0

Now this does not produce an warning:
l_logit = glm(y~x, data=as.data.frame(l_yx), family=binomial(link=logit))

Can someone tell me why occurred?

Also, again, here are the screen shots of my data that I tried to send
earlier (two screen shots, two pages):
http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf

Thank you for your help,

Ben

On Thu, Dec 1, 2011 at 3:25 PM, Ben quant ccqu...@gmail.com wrote:

 Oops! Please ignore my last post. I mistakenly gave you different data I
 was testing with. This is the correct data:

 Here you go:

  attach(as.data.frame(l_yx))
   range(x[y==0])
 [1]  0.0 14.66518
  range(x[y==1])
 [1]  0.0 13.49791


 How do I know what is acceptable?

 Also, here are the screen shots of my data that I tried to send earlier
 (two screen shots, two pages):
 http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf

 Thank you,

 Ben

 On Thu, Dec 1, 2011 at 3:07 PM, Ben quant ccqu...@gmail.com wrote:

 Here you go:

  attach(as.data.frame(l_yx))
  range(x[y==1])
 [1] -22500.746.
   range(x[y==0])
 [1] -10076.5303653.0228

 How do I know what is acceptable?

 Also, here are the screen shots of my data that I tried to send earlier
 (two screen shots, two pages):
 http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf

 Thank you,

 Ben


 On Thu, Dec 1, 2011 at 2:24 PM, peter dalgaard pda...@gmail.com wrote:


 On Dec 1, 2011, at 21:32 , Ben quant wrote:

  Thank you for the feedback, but my data looks fine to me. Please tell
 me if I'm not understanding.

 Hum, then maybe it really is a case of a transition region being short
 relative to the range of your data. Notice that the warning is just that: a
 warning. I do notice that the distribution of your x values is rather
 extreme -- you stated a range of 0--14 and a mean of 0.01. And after all,
 an odds ratio of 3000 per unit is only a tad over a doubling per 0.1 units.

 Have a look at  range(x[y==0]) and range(x[y==1]).


 
  I followed your instructions and here is a sample of the first 500
 values : (info on 'd' is below that)
 
d - as.data.frame(l_yx)
   x = with(d, y[order(x)])
   x[1:500] # I have 1's and 0's dispersed throughout
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0
  [301] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 
  # I get the warning still
   l_df = as.data.frame(l_yx)
   l_logit = glm(y~x, data=l_df, family=binomial(link=logit))
 
  Warning message:
  glm.fit: fitted probabilities numerically 0 or 1 occurred
 
  # some info on 'd' above:
 
   d[1:500,1]
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [301] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
   d[1:500,2]
[1] 3.023160e-03 7.932130e-02 0.00e+00 1.779657e-02 1.608374e-01
 0.00e+00 5.577064e-02 7.753926e-03 

Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread peter dalgaard

On Dec 1, 2011, at 23:43 , Ben quant wrote:

 I'm not proposing this as a permanent solution, just investigating the 
 warning. I zeroed out the three outliers and received no warning. Can someone 
 tell me why I am getting no warning now?

It's easier to explain why you got the warning before. If the OR for a one unit 
change is 3000, the OR for a 14 unit change is on the order of 10^48 and that 
causes over/underflow in the conversion to probabilities.

I'm still baffled at how you can get that model fitted to your data, though. 
One thing is that you can have situations where there are fitted probabilities 
of one corresponding to data that are all one and/or fitted zeros where data 
are zero, but you seem to have cases where you have both zeros and ones at both 
ends of the range of x. Fitting a zero to a one or vice versa would make the 
likelihood zero, so you'd expect that the algorithm would find a better set of 
parameters rather quickly. Perhaps the extremely large number of observations 
that you have has something to do with it? 

You'll get the warning if the fitted zeros or ones occur at any point of the 
iterative procedure. Maybe it isn't actually true for the final model, but that 
wouldn't seem consistent with the OR that you cited.

Anyways, your real problem lies with the distribution of the x values. I'd want 
to try transforming it to something more sane. Taking logarithms is the obvious 
idea, but you'd need to find out what to do about the zeros -- perhaps log(x + 
1e-4) ? Or maybe just cut the outliers down to size with pmin(x,1).

 
 I did this 3 times to get rid of the 3 outliers:
 mx_dims = arrayInd(which.max(l_yx), dim(l_yx))
 l_yx[mx_dims] = 0
 
 Now this does not produce an warning:
 l_logit = glm(y~x, data=as.data.frame(l_yx), family=binomial(link=logit)) 
 
 Can someone tell me why occurred?
 
 Also, again, here are the screen shots of my data that I tried to send 
 earlier (two screen shots, two pages):
 http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf
 
 Thank you for your help,
 
 Ben
 
 On Thu, Dec 1, 2011 at 3:25 PM, Ben quant ccqu...@gmail.com wrote:
 Oops! Please ignore my last post. I mistakenly gave you different data I was 
 testing with. This is the correct data:
 
 Here you go:
 
  attach(as.data.frame(l_yx))
   range(x[y==0])
 [1]  0.0 14.66518
  range(x[y==1])
 [1]  0.0 13.49791
 
 
 How do I know what is acceptable? 
 
 Also, here are the screen shots of my data that I tried to send earlier (two 
 screen shots, two pages):
 http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf
 
 Thank you,
 
 Ben
 
 On Thu, Dec 1, 2011 at 3:07 PM, Ben quant ccqu...@gmail.com wrote:
 Here you go:
 
  attach(as.data.frame(l_yx))
  range(x[y==1])
 [1] -22500.746.
   range(x[y==0])
 [1] -10076.5303653.0228
 
 How do I know what is acceptable? 
 
 Also, here are the screen shots of my data that I tried to send earlier (two 
 screen shots, two pages):
 http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf
 
 Thank you,
 
 Ben
 
 
 On Thu, Dec 1, 2011 at 2:24 PM, peter dalgaard pda...@gmail.com wrote:
 
 On Dec 1, 2011, at 21:32 , Ben quant wrote:
 
  Thank you for the feedback, but my data looks fine to me. Please tell me if 
  I'm not understanding.
 
 Hum, then maybe it really is a case of a transition region being short 
 relative to the range of your data. Notice that the warning is just that: a 
 warning. I do notice that the distribution of your x values is rather extreme 
 -- you stated a range of 0--14 and a mean of 0.01. And after all, an odds 
 ratio of 3000 per unit is only a tad over a doubling per 0.1 units.
 
 Have a look at  range(x[y==0]) and range(x[y==1]).
 
 
 
  I followed your instructions and here is a sample of the first 500 values : 
  (info on 'd' is below that)
 
d - as.data.frame(l_yx)
   x = with(d, y[order(x)])
   x[1:500] # I have 1's and 0's dispersed throughout
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
  [101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [201] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 
  0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 
  1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0
  [301] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  [401] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 
  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
  0 0 0 0 0 0 0 0 0 0 

Re: [R] logistic regression - glm.fit: fitted probabilities numerically 0 or 1 occurred

2011-12-01 Thread Ben quant
Thank you so much for your help.

The data I am using is the last file called l_yx.RData at this link (the
second file contains the plots from earlier):
http://scientia.crescat.net/static/ben/

Seems like the warning went away with pmin(x,1) but now the OR is over
15k.  If I multiple my x's by 1000 I get a much more realistic OR. So I
guess this brings me to a much different question: aren't OR's comparable
between factors/data? In this case they don't seem to be. However, with
different data the OR's only change a very small amount (+8.0e-4) when I
multiply the x's by 1000. I don't understand.

Anyways, here is a run with the raw data and a run with your suggestion
(pmin(x,1)) that removed the error:

 l_logit = glm(y~x, data=as.data.frame(l_yx),
family=binomial(link=logit))

 l_logit

Call:  glm(formula = y ~ x, family = binomial(link = logit), data =
as.data.frame(l_yx))

Coefficients:
(Intercept)x
 -2.2938.059

Degrees of Freedom: 690302 Total (i.e. Null);  690301 Residual
Null Deviance:  448800
Residual Deviance: 447100   AIC: 447100

 l_exp_coef = exp(l_logit$coefficients)[2]

 l_exp_coef
   x
3161.781

 dim(l_yx)
[1] 690303  2

 l_yx = cbind(l_yx[,1],pmin(l_yx[,2],1))

 dim(l_yx)
[1] 690303  2

 colnames(l_yx) = c('y','x')

 mean(l_yx[,2])
[1] 0.01117248

 range(l_yx[,2])
[1] 0 1

 head(l_yx[,2])
[1] 0.00302316 0.07932130 0. 0.01779657 0.16083735 0.

 unique(l_yx[,1])
[1] 0 1

 l_logit = glm(y~x, data=as.data.frame(l_yx),
family=binomial(link=logit))

 l_logit

Call:  glm(formula = y ~ x, family = binomial(link = logit), data =
as.data.frame(l_yx))

Coefficients:
(Intercept)x
 -2.3129.662

Degrees of Freedom: 690302 Total (i.e. Null);  690301 Residual
Null Deviance:  448800
Residual Deviance: 446800   AIC: 446800

 l_exp_coef = exp(l_logit$coefficients)[2]

 l_exp_coef
   x
15709.52


Thanks,

Ben



On Thu, Dec 1, 2011 at 4:32 PM, peter dalgaard pda...@gmail.com wrote:


 On Dec 1, 2011, at 23:43 , Ben quant wrote:

  I'm not proposing this as a permanent solution, just investigating the
 warning. I zeroed out the three outliers and received no warning. Can
 someone tell me why I am getting no warning now?

 It's easier to explain why you got the warning before. If the OR for a one
 unit change is 3000, the OR for a 14 unit change is on the order of 10^48
 and that causes over/underflow in the conversion to probabilities.

 I'm still baffled at how you can get that model fitted to your data,
 though. One thing is that you can have situations where there are fitted
 probabilities of one corresponding to data that are all one and/or fitted
 zeros where data are zero, but you seem to have cases where you have both
 zeros and ones at both ends of the range of x. Fitting a zero to a one or
 vice versa would make the likelihood zero, so you'd expect that the
 algorithm would find a better set of parameters rather quickly. Perhaps the
 extremely large number of observations that you have has something to do
 with it?

 You'll get the warning if the fitted zeros or ones occur at any point of
 the iterative procedure. Maybe it isn't actually true for the final model,
 but that wouldn't seem consistent with the OR that you cited.

 Anyways, your real problem lies with the distribution of the x values. I'd
 want to try transforming it to something more sane. Taking logarithms is
 the obvious idea, but you'd need to find out what to do about the zeros --
 perhaps log(x + 1e-4) ? Or maybe just cut the outliers down to size with
 pmin(x,1).

 
  I did this 3 times to get rid of the 3 outliers:
  mx_dims = arrayInd(which.max(l_yx), dim(l_yx))
  l_yx[mx_dims] = 0
 
  Now this does not produce an warning:
  l_logit = glm(y~x, data=as.data.frame(l_yx),
 family=binomial(link=logit))
 
  Can someone tell me why occurred?
 
  Also, again, here are the screen shots of my data that I tried to send
 earlier (two screen shots, two pages):
  http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf
 
  Thank you for your help,
 
  Ben
 
  On Thu, Dec 1, 2011 at 3:25 PM, Ben quant ccqu...@gmail.com wrote:
  Oops! Please ignore my last post. I mistakenly gave you different data I
 was testing with. This is the correct data:
 
  Here you go:
 
   attach(as.data.frame(l_yx))
range(x[y==0])
  [1]  0.0 14.66518
   range(x[y==1])
  [1]  0.0 13.49791
 
 
  How do I know what is acceptable?
 
  Also, here are the screen shots of my data that I tried to send earlier
 (two screen shots, two pages):
  http://scientia.crescat.net/static/ben/warn%20num%200%20or%201.pdf
 
  Thank you,
 
  Ben
 
  On Thu, Dec 1, 2011 at 3:07 PM, Ben quant ccqu...@gmail.com wrote:
  Here you go:
 
   attach(as.data.frame(l_yx))
   range(x[y==1])
  [1] -22500.746.
range(x[y==0])
  [1] -10076.5303653.0228
 
  How do I know what is acceptable?
 
  Also, here are the screen shots of my data that I tried to send earlier
 (two 

Re: [R] logistic regression by glm

2011-11-20 Thread Uwe Ligges



On 20.11.2011 12:46, tujchl wrote:

HI

I use glm in R to do logistic regression. and treat both response and
predictor as factor
In my first try:

***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
as.factor(2281517), family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 678.55 on 498 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

And I remodel it and *want no intercept*:
***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
as.factor(7161521) - 1, family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 691.76 on 499 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

*As show above in my second model it return no intercept but look this:
Model1:
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
Model2:
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

They are exactly the same. Could you please tell me what happen?


Actually it does not make sense to estimate the model without an 
intercept unless you assume that it is exactly zero for the first levels 
of your factors. Think about the contrasts you are interested in. Looks 
like not the default?


Uwe Ligges




Thank you in advance


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression by glm

2011-11-20 Thread Uwe Ligges



On 20.11.2011 16:58, 屠鞠传礼 wrote:

Thank you Ligges :)
one more question:
my response value diagnostic have 4 levels (0, 1, 2 and 3), so I use it like 
this:
as.factor(diagnostic) ~ as.factor(7161521) +as.factor(2281517)
Is it all right?



Uhh. 4 levels? Than I doubt logistic regression is the right tool for 
you. Please revisit the theory first: It is intended for 2 levels...



Uwe Ligges










在 2011-11-20 23:45:23,Uwe Liggeslig...@statistik.tu-dortmund.de  写道:



On 20.11.2011 12:46, tujchl wrote:

HI

I use glm in R to do logistic regression. and treat both response and
predictor as factor
In my first try:

***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
as.factor(2281517), family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 678.55 on 498 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

And I remodel it and *want no intercept*:
***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
as.factor(7161521) - 1, family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 691.76 on 499 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

*As show above in my second model it return no intercept but look this:
Model1:
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
Model2:
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

They are exactly the same. Could you please tell me what happen?


Actually it does not make sense to estimate the model without an
intercept unless you assume that it is exactly zero for the first levels
of your factors. Think about the contrasts you are interested in. Looks
like not the default?

Uwe Ligges




Thank you in advance


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression by glm

2011-11-20 Thread Uwe Ligges



On 20.11.2011 17:27, 屠鞠传礼 wrote:

I worried it too, Do you have idear that what tools I can use?



Depends on your aims - what you want to do with the fitted model.
A multinomial model, some kind of discriminant analysis (lda, qda), tree 
based methods, svm and so son come to mind. You probably want to discuss 
this on some statistics mailing list/forum or among local experts rather 
than on the R list. Since this is actually not that R releated.


Uwe Ligges








在 2011-11-21 00:13:26,Uwe Liggeslig...@statistik.tu-dortmund.de  写道:



On 20.11.2011 16:58, 屠鞠传礼 wrote:

Thank you Ligges :)
one more question:
my response value diagnostic have 4 levels (0, 1, 2 and 3), so I use it like 
this:
as.factor(diagnostic) ~ as.factor(7161521) +as.factor(2281517)
Is it all right?



Uhh. 4 levels? Than I doubt logistic regression is the right tool for
you. Please revisit the theory first: It is intended for 2 levels...


Uwe Ligges










在 2011-11-20 23:45:23,Uwe Liggeslig...@statistik.tu-dortmund.de   写道:



On 20.11.2011 12:46, tujchl wrote:

HI

I use glm in R to do logistic regression. and treat both response and
predictor as factor
In my first try:

***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
as.factor(2281517), family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 678.55 on 498 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

And I remodel it and *want no intercept*:
***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
as.factor(7161521) - 1, family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 691.76 on 499 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

*As show above in my second model it return no intercept but look this:
Model1:
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
Model2:
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

They are exactly the same. Could you please tell me what happen?


Actually it does not make sense to estimate the model without an
intercept unless you assume that it is exactly zero for the first levels
of your factors. Think about the contrasts you are interested in. Looks
like not the default?

Uwe Ligges




Thank you in advance


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression by glm

2011-11-20 Thread 屠鞠传礼
I worried it too, Do you have idear that what tools I can use?





ÔÚ 2011-11-21 00:13:26£¬Uwe Ligges lig...@statistik.tu-dortmund.de дµÀ£º


On 20.11.2011 16:58, ÍÀ¾Ï´«Àñ wrote:
 Thank you Ligges :)
 one more question:
 my response value diagnostic have 4 levels (0, 1, 2 and 3), so I use it 
 like this:
 as.factor(diagnostic) ~ as.factor(7161521) +as.factor(2281517)
 Is it all right?


Uhh. 4 levels? Than I doubt logistic regression is the right tool for 
you. Please revisit the theory first: It is intended for 2 levels...


Uwe Ligges









 ÔÚ 2011-11-20 23:45:23£¬Uwe Liggeslig...@statistik.tu-dortmund.de  дµÀ£º


 On 20.11.2011 12:46, tujchl wrote:
 HI

 I use glm in R to do logistic regression. and treat both response and
 predictor as factor
 In my first try:

 ***
 Call:
 glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
 as.factor(2281517), family = binomial())

 Deviance Residuals:
 Min 1Q Median 3Q Max
 -1.5370 -1.0431 -0.9416 1.3065 1.4331

 Coefficients:
 Estimate Std. Error z value Pr(|z|)
 (Intercept) -0.58363 0.27948 -2.088 0.0368 *
 as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
 as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
 as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
 as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
 ---
 Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1

 (Dispersion parameter for binomial family taken to be 1)

 Null deviance: 678.55 on 498 degrees of freedom
 Residual deviance: 671.20 on 494 degrees of freedom
 AIC: 681.2

 Number of Fisher Scoring iterations: 4
 ***

 And I remodel it and *want no intercept*:
 ***
 Call:
 glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
 as.factor(7161521) - 1, family = binomial())

 Deviance Residuals:
 Min 1Q Median 3Q Max
 -1.5370 -1.0431 -0.9416 1.3065 1.4331

 Coefficients:
 Estimate Std. Error z value Pr(|z|)
 as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
 as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
 as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
 as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
 as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
 ---
 Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1

 (Dispersion parameter for binomial family taken to be 1)

 Null deviance: 691.76 on 499 degrees of freedom
 Residual deviance: 671.20 on 494 degrees of freedom
 AIC: 681.2

 Number of Fisher Scoring iterations: 4
 ***

 *As show above in my second model it return no intercept but look this:
 Model1:
 (Intercept) -0.58363 0.27948 -2.088 0.0368 *
 Model2:
 as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

 They are exactly the same. Could you please tell me what happen?

 Actually it does not make sense to estimate the model without an
 intercept unless you assume that it is exactly zero for the first levels
 of your factors. Think about the contrasts you are interested in. Looks
 like not the default?

 Uwe Ligges



 Thank you in advance


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression by glm

2011-11-20 Thread 屠鞠传礼
Thank you Ligges :)
one more question:
my response value diagnostic have 4 levels (0, 1, 2 and 3), so I use it like 
this:
as.factor(diagnostic) ~ as.factor(7161521) +as.factor(2281517) 
Is it all right?
 



ÔÚ 2011-11-20 23:45:23£¬Uwe Ligges lig...@statistik.tu-dortmund.de дµÀ£º


On 20.11.2011 12:46, tujchl wrote:
 HI

 I use glm in R to do logistic regression. and treat both response and
 predictor as factor
 In my first try:

 ***
 Call:
 glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
 as.factor(2281517), family = binomial())

 Deviance Residuals:
 Min 1Q Median 3Q Max
 -1.5370 -1.0431 -0.9416 1.3065 1.4331

 Coefficients:
 Estimate Std. Error z value Pr(|z|)
 (Intercept) -0.58363 0.27948 -2.088 0.0368 *
 as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
 as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
 as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
 as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
 ---
 Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1

 (Dispersion parameter for binomial family taken to be 1)

 Null deviance: 678.55 on 498 degrees of freedom
 Residual deviance: 671.20 on 494 degrees of freedom
 AIC: 681.2

 Number of Fisher Scoring iterations: 4
 ***

 And I remodel it and *want no intercept*:
 ***
 Call:
 glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
 as.factor(7161521) - 1, family = binomial())

 Deviance Residuals:
 Min 1Q Median 3Q Max
 -1.5370 -1.0431 -0.9416 1.3065 1.4331

 Coefficients:
 Estimate Std. Error z value Pr(|z|)
 as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
 as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
 as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
 as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
 as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
 ---
 Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1

 (Dispersion parameter for binomial family taken to be 1)

 Null deviance: 691.76 on 499 degrees of freedom
 Residual deviance: 671.20 on 494 degrees of freedom
 AIC: 681.2

 Number of Fisher Scoring iterations: 4
 ***

 *As show above in my second model it return no intercept but look this:
 Model1:
 (Intercept) -0.58363 0.27948 -2.088 0.0368 *
 Model2:
 as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

 They are exactly the same. Could you please tell me what happen?

Actually it does not make sense to estimate the model without an 
intercept unless you assume that it is exactly zero for the first levels 
of your factors. Think about the contrasts you are interested in. Looks 
like not the default?

Uwe Ligges



 Thank you in advance


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression by glm

2011-11-20 Thread 屠鞠传礼
Thank you very much :)
I search on net and find sometimes response value in logistic model can have 
more than 2 values, and the way of this kinds of regression is called Ordinal 
Logistic Regression. and even we can caculate it by the same way I mean glm in 
R.
here are some references:
1. http://en.wikipedia.org/wiki/Ordered_logit
2. http://www.stat.ubc.ca/~rollin/teach/643w04/lec/node62.html
above two tell us what is Ordinal Logistic Regression.
3. http://www.ats.ucla.edu/stat/r/dae/ologit.htm
this show that we can use glm to model it





ÔÚ 2011-11-21 00:56:33£¬Uwe Ligges lig...@statistik.tu-dortmund.de дµÀ£º


On 20.11.2011 17:27, ÍÀ¾Ï´«Àñ wrote:
 I worried it too, Do you have idear that what tools I can use?


Depends on your aims - what you want to do with the fitted model.
A multinomial model, some kind of discriminant analysis (lda, qda), tree 
based methods, svm and so son come to mind. You probably want to discuss 
this on some statistics mailing list/forum or among local experts rather 
than on the R list. Since this is actually not that R releated.

Uwe Ligges







 ÔÚ 2011-11-21 00:13:26£¬Uwe Liggeslig...@statistik.tu-dortmund.de  дµÀ£º


 On 20.11.2011 16:58, ÍÀ¾Ï´«Àñ wrote:
 Thank you Ligges :)
 one more question:
 my response value diagnostic have 4 levels (0, 1, 2 and 3), so I use it 
 like this:
 as.factor(diagnostic) ~ as.factor(7161521) +as.factor(2281517)
 Is it all right?


 Uhh. 4 levels? Than I doubt logistic regression is the right tool for
 you. Please revisit the theory first: It is intended for 2 levels...


 Uwe Ligges









 ÔÚ 2011-11-20 23:45:23£¬Uwe Liggeslig...@statistik.tu-dortmund.de   
 дµÀ£º


 On 20.11.2011 12:46, tujchl wrote:
 HI

 I use glm in R to do logistic regression. and treat both response and
 predictor as factor
 In my first try:

 ***
 Call:
 glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
 as.factor(2281517), family = binomial())

 Deviance Residuals:
 Min 1Q Median 3Q Max
 -1.5370 -1.0431 -0.9416 1.3065 1.4331

 Coefficients:
 Estimate Std. Error z value Pr(|z|)
 (Intercept) -0.58363 0.27948 -2.088 0.0368 *
 as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
 as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
 as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
 as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
 ---
 Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1

 (Dispersion parameter for binomial family taken to be 1)

 Null deviance: 678.55 on 498 degrees of freedom
 Residual deviance: 671.20 on 494 degrees of freedom
 AIC: 681.2

 Number of Fisher Scoring iterations: 4
 ***

 And I remodel it and *want no intercept*:
 ***
 Call:
 glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
 as.factor(7161521) - 1, family = binomial())

 Deviance Residuals:
 Min 1Q Median 3Q Max
 -1.5370 -1.0431 -0.9416 1.3065 1.4331

 Coefficients:
 Estimate Std. Error z value Pr(|z|)
 as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
 as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
 as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
 as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
 as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
 ---
 Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯ 0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1

 (Dispersion parameter for binomial family taken to be 1)

 Null deviance: 691.76 on 499 degrees of freedom
 Residual deviance: 671.20 on 494 degrees of freedom
 AIC: 681.2

 Number of Fisher Scoring iterations: 4
 ***

 *As show above in my second model it return no intercept but look this:
 Model1:
 (Intercept) -0.58363 0.27948 -2.088 0.0368 *
 Model2:
 as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

 They are exactly the same. Could you please tell me what happen?

 Actually it does not make sense to estimate the model without an
 intercept unless you assume that it is exactly zero for the first levels
 of your factors. Think about the contrasts you are interested in. Looks
 like not the default?

 Uwe Ligges



 Thank you in advance


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 

Re: [R] logistic regression by glm

2011-11-20 Thread David Winsemius


On Nov 20, 2011, at 7:26 PM, 屠鞠传礼 wrote:


Thank you very much :)
I search on net and find sometimes response value in logistic model  
can have more than 2 values, and the way of this kinds of regression  
is called Ordinal Logistic Regression. and even we can caculate it  
by the same way I mean glm in R.

here are some references:
1. http://en.wikipedia.org/wiki/Ordered_logit
2. http://www.stat.ubc.ca/~rollin/teach/643w04/lec/node62.html
above two tell us what is Ordinal Logistic Regression.
3. http://www.ats.ucla.edu/stat/r/dae/ologit.htm
this show that we can use glm to model it


When I looked through the UCLA code it appeared they were using the  
Design package (now superseded by the `rms` package) and that the  
function was `lrm` rather than `glm`. In addition to Harrell's  
excellent text which has a full chapter on this topic you might also  
want to look at Laura Thompson's Companion to Agresti's text:


https://home.comcast.net/~lthompson221/Splusdiscrete2.pdf

--
David.



ÔÚ 2011-11-21 00:56:33£¬Uwe Ligges lig...@statistik.tu- 
dortmund.de дµÀ£º



On 20.11.2011 17:27, ÍÀ¾Ï´«Àñ wrote:

I worried it too, Do you have idear that what tools I can use?



Depends on your aims - what you want to do with the fitted model.
A multinomial model, some kind of discriminant analysis (lda, qda),  
tree
based methods, svm and so son come to mind. You probably want to  
discuss
this on some statistics mailing list/forum or among local experts  
rather

than on the R list. Since this is actually not that R releated.

Uwe Ligges








ÔÚ 2011-11-21 00:13:26£¬Uwe Liggeslig...@statistik.tu- 
dortmund.de  дµÀ£º



On 20.11.2011 16:58, ÍÀ¾Ï´«Àñ wrote:

Thank you Ligges :)
one more question:
my response value diagnostic have 4 levels (0, 1, 2 and 3), so  
I use it like this:

as.factor(diagnostic) ~ as.factor(7161521) +as.factor(2281517)
Is it all right?



Uhh. 4 levels? Than I doubt logistic regression is the right tool  
for
you. Please revisit the theory first: It is intended for 2  
levels...



Uwe Ligges










ÔÚ 2011-11-20 23:45:23£¬Uwe Liggeslig...@statistik.tu-dortmun 
d.de   дµÀ£º



On 20.11.2011 12:46, tujchl wrote:

HI

I use glm in R to do logistic regression. and treat both  
response and

predictor as factor
In my first try:

***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(7161521) +
as.factor(2281517), family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
as.factor(7161521)2 1.39811 0.66618 2.099 0.0358 *
as.factor(7161521)3 0.28192 0.83255 0.339 0.7349
as.factor(2281517)2 -1.11284 0.63692 -1.747 0.0806 .
as.factor(2281517)3 -0.02286 0.80708 -0.028 0.9774
---
Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯  
0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1


(Dispersion parameter for binomial family taken to be 1)

Null deviance: 678.55 on 498 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

And I remodel it and *want no intercept*:
***
Call:
glm(formula = as.factor(diagnostic) ~ as.factor(2281517) +
as.factor(7161521) - 1, family = binomial())

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5370 -1.0431 -0.9416 1.3065 1.4331

Coefficients:
Estimate Std. Error z value Pr(|z|)
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 *
as.factor(2281517)2 -1.6965 0.6751 -2.513 0.0120 *
as.factor(2281517)3 -0.6065 0.8325 -0.728 0.4663
as.factor(7161521)2 1.3981 0.6662 2.099 0.0358 *
as.factor(7161521)3 0.2819 0.8325 0.339 0.7349
---
Signif. codes: 0 ¡®***¡¯ 0.001 ¡®**¡¯ 0.01 ¡®*¡¯  
0.05 ¡®.¡¯ 0.1 ¡® ¡¯ 1


(Dispersion parameter for binomial family taken to be 1)

Null deviance: 691.76 on 499 degrees of freedom
Residual deviance: 671.20 on 494 degrees of freedom
AIC: 681.2

Number of Fisher Scoring iterations: 4
***

*As show above in my second model it return no intercept but  
look this:

Model1:
(Intercept) -0.58363 0.27948 -2.088 0.0368 *
Model2:
as.factor(2281517)1 -0.5836 0.2795 -2.088 0.0368 **

They are exactly the same. Could you please tell me what happen?


Actually it does not make sense to estimate the model without an
intercept unless you assume that it is exactly zero for the  
first levels
of your factors. Think about the contrasts you are interested  
in. Looks

like not the default?

Uwe Ligges




Thank you in advance


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-by-glm-tp4088471p4088471.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list

Re: [R] logistic Regression using lmer

2011-11-12 Thread Ben Bolker
arunkumar akpbond007 at gmail.com writes:

 
 Hi
 
 can we perform logistic regression using lmer function. Please help me with
 this
 

  Yes.  
  
  library(lme4)
  glmer([reponse]~[fixed effects (covariates)]+(1|[grouping variable]),
data=[data frame], family=binomial)

  Further questions on this topic should probably go to the
r-sig-mixed-models mailing list.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression - Variable Selection Methods With Prediction

2011-10-26 Thread RAJ
Can I atleast get help with what pacakge to use for logistic
regression with all possible models and do prediction. I know i can
use regsubsets but i am not sure if it has any prediction functions to
go with it.

Thanks

On Oct 25, 6:54 pm, RAJ dheerajathr...@gmail.com wrote:
 Hello,

 I am pretty new to R, I have always used SAS and SAS products. My
 target variable is binary ('Y' and 'N') and i have about 14 predictor
 variables. My goal is to compare different variable selection methods
 like Forward, Backward, All possible subsests. I am using
 misclassification rate to pick the winner method.

 This is what i have as of now,

 Reg - glm (Graduation ~., DFtrain,family=binomial(link=logit))
                 step - extractAIC(Reg, direction=forward)
                 pred - predict(Reg, DFtest,type=response)
                 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
 This program actually works but I needed to check to make sure am
 doing this right. Also, I am getting the same misclassification rates
 for all different methods.

 I also tried to use

 Reg - leaps(Graduation ~., DFtrain)
                 pred - predict(Reg, DFtest,type=response)
                 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
                 #print(summary(mis))
 which doesnt work

 and

 Reg - regsubsets(Graduation ~., DFtrain)
                 pred - predict(Reg, DFtest,type=response)
                 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
                 #print(summary(mis))

 The Regsubsets will work but the 'predict' function does not work with
 it. Is there any other way to do predictions when using regsubsets

 Any help is appreciated.

 Thanks,

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression - Variable Selection Methods With Prediction

2011-10-26 Thread Steve_Friedman
Try the glm package

Steve Friedman Ph. D.
Ecologist  / Spatial Statistical Analyst
Everglades and Dry Tortugas National Park
950 N Krome Ave (3rd Floor)
Homestead, Florida 33034

steve_fried...@nps.gov
Office (305) 224 - 4282
Fax (305) 224 - 4147

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression - Variable Selection Methods With Prediction

2011-10-26 Thread Steve Lianoglou
Hi,

On Wed, Oct 26, 2011 at 12:35 PM, RAJ dheerajathr...@gmail.com wrote:
 Can I atleast get help with what pacakge to use for logistic
 regression with all possible models and do prediction. I know i can
 use regsubsets but i am not sure if it has any prediction functions to
 go with it.

Maybe you could try glmnet instead.

It doesn't give you all possible models, but rather the best one at
a given value for the penalty (lambda) parameter.

HTH,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression - Variable Selection Methods With Prediction

2011-10-26 Thread Weidong Gu
Check glmulti package for all subset selection.

Weidong Gu

On Wed, Oct 26, 2011 at 12:35 PM, RAJ dheerajathr...@gmail.com wrote:
 Can I atleast get help with what pacakge to use for logistic
 regression with all possible models and do prediction. I know i can
 use regsubsets but i am not sure if it has any prediction functions to
 go with it.

 Thanks

 On Oct 25, 6:54 pm, RAJ dheerajathr...@gmail.com wrote:
 Hello,

 I am pretty new to R, I have always used SAS and SAS products. My
 target variable is binary ('Y' and 'N') and i have about 14 predictor
 variables. My goal is to compare different variable selection methods
 like Forward, Backward, All possible subsests. I am using
 misclassification rate to pick the winner method.

 This is what i have as of now,

 Reg - glm (Graduation ~., DFtrain,family=binomial(link=logit))
                 step - extractAIC(Reg, direction=forward)
                 pred - predict(Reg, DFtest,type=response)
                 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
 This program actually works but I needed to check to make sure am
 doing this right. Also, I am getting the same misclassification rates
 for all different methods.

 I also tried to use

 Reg - leaps(Graduation ~., DFtrain)
                 pred - predict(Reg, DFtest,type=response)
                 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
                 #print(summary(mis))
 which doesnt work

 and

 Reg - regsubsets(Graduation ~., DFtrain)
                 pred - predict(Reg, DFtest,type=response)
                 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
                 #print(summary(mis))

 The Regsubsets will work but the 'predict' function does not work with
 it. Is there any other way to do predictions when using regsubsets

 Any help is appreciated.

 Thanks,

 __
 r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression - Variable Selection Methods With Prediction

2011-10-26 Thread Bert Gunter
You mean the glm()  _function_ in the stats package.

?glm

(just to avoid confusion)

-- Bert

On Wed, Oct 26, 2011 at 10:31 AM, steve_fried...@nps.gov wrote:

 Try the glm package

 Steve Friedman Ph. D.
 Ecologist  / Spatial Statistical Analyst
 Everglades and Dry Tortugas National Park
 950 N Krome Ave (3rd Floor)
 Homestead, Florida 33034

 steve_fried...@nps.gov
 Office (305) 224 - 4282
 Fax (305) 224 - 4147

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression - Variable Selection Methods With Prediction

2011-10-26 Thread Marc Schwartz
The reason that you are not likely getting replies is that what you propose to 
do is considered a poor way of building models. 

You need to get out of the SAS Mindset.

I would suggest you obtain a copy of Frank Harrell's book:

  http://www.amazon.com/exec/obidos/ASIN/0387952322/

and then consider using his 'rms' package on CRAN to engage in modeling 
building strategies and validation.

Regards,

Marc Schwartz

On Oct 26, 2011, at 11:35 AM, RAJ wrote:

 Can I atleast get help with what pacakge to use for logistic
 regression with all possible models and do prediction. I know i can
 use regsubsets but i am not sure if it has any prediction functions to
 go with it.
 
 Thanks
 
 On Oct 25, 6:54 pm, RAJ dheerajathr...@gmail.com wrote:
 Hello,
 
 I am pretty new to R, I have always used SAS and SAS products. My
 target variable is binary ('Y' and 'N') and i have about 14 predictor
 variables. My goal is to compare different variable selection methods
 like Forward, Backward, All possible subsests. I am using
 misclassification rate to pick the winner method.
 
 This is what i have as of now,
 
 Reg - glm (Graduation ~., DFtrain,family=binomial(link=logit))
 step - extractAIC(Reg, direction=forward)
 pred - predict(Reg, DFtest,type=response)
 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
 This program actually works but I needed to check to make sure am
 doing this right. Also, I am getting the same misclassification rates
 for all different methods.
 
 I also tried to use
 
 Reg - leaps(Graduation ~., DFtrain)
 pred - predict(Reg, DFtest,type=response)
 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
 #print(summary(mis))
 which doesnt work
 
 and
 
 Reg - regsubsets(Graduation ~., DFtrain)
 pred - predict(Reg, DFtest,type=response)
 mis - mean({pred  0.5} != {DFtest[,Graduation] == Y})
 #print(summary(mis))
 
 The Regsubsets will work but the 'predict' function does not work with
 it. Is there any other way to do predictions when using regsubsets
 
 Any help is appreciated.
 
 Thanks,

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression: default computed probability

2011-09-21 Thread Marc Schwartz
On Sep 21, 2011, at 10:25 AM, n wrote:

 Hello all,
 
 Suppose in a logistic regression model, the binary outcome is coded as
 0 or 1.
 In SAS, the default probability computed is for Y = 0 (smaller of the
 two values) . However, in SPSS the probability computed is for Y = 1
 (greater of the two values).
 
 Which one does R compute, the probability for the smaller or the
 greater value?
 
 I went through the documentation in a hurry but couldn't find a clue.
 
 Thanks
 
 Nikhil


From ?glm in the Details section:

For binomial and quasibinomial families the response can also be specified as a 
factor (when the first level denotes failure and all others success) or as a 
two-column matrix with the columns giving the numbers of successes and failures.

So P(Y = 1) if using 0/1.

HTH,

Marc Schwartz

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression with 2 factors and 3 covariates

2011-06-14 Thread Uwe Ligges



On 13.06.2011 19:37, Alal wrote:

Hello

I'm trying to write a model for my data, but Im not sure it's statistically
correct.

I have one variable (2 levels: A or B). To explain it, I've got 2 factors
and 3 continuous variables. I need to do a logistic regression, but...

First: can I actually do a logistic regression with all of this? I know it's
a basic question (sorry) but I just cant find a clear answer.


Yes, you can do that, but we cannot know if it is the most appropriate 
method given your problem and your data.




Then: if yes, what function do I use? I tried with a binomial glm,


Yes.



but I
also found the function 'mlogit()' that seems appropriate. Is it incorrect
to use glm() in that case?


If you want a simple logistic regression, just use the glm() approach.



Finally: any advice, things I should pay attention to?


The problem, the data and the method.

Since your question is rather basic, it'd suggest to ask for local 
advice in any case,


Uwe Ligges





Thanks

B.




--
View this message in context: 
http://r.789695.n4.nabble.com/Logistic-regression-with-2-factors-and-3-covariates-tp3594377p3594377.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2011-06-10 Thread Frank Harrell
Which statistical principles are you invoking on which to base such analyses?
Frank

Sergio Della Franca wrote:
 
 Dear R-Helpers,
 
 I want to perform a logistic regression on my dataset (y).
 
 I used the following code:
 
 logistic-glm(formula=interest_variable~.,family = binomial(link =
 logit),
 data = y)
 
 
 This run correctly.
 Then i want to develop the logistic regression with three different
 method:
 -forward
 -backward
 -stepwise
 
 I used these procedure:
 forward-step(logistica,direction=forward)
 backward-step(logistica,direction=backward)
 stepwise-step(logistica,direction=both)
 
 Even these run correctly, but i obtained the same results with the three
 different procedures.
 
 Then I tought i made some mistakes.
 
 My question is:
 
 Is correct what i did?
 Is correct that three different methods return the same results?
 
 If i made some mistakes, what is the correct method to correctly perform
 the
 three different logistics regression?
 
 
 Thank you in advance.
 
 
 Sergio Della Franca.
 
   [[alternative HTML version deleted]]
 
 __
 r-h...@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: 
http://r.789695.n4.nabble.com/Logistic-Regression-tp821301p3588135.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2011-06-10 Thread John Sorkin
First a word of caution: Forward, backward, and stepwise regression analyses 
are not well received among statisticians. There are many reasons for this. 
Some of the reasons include:
(1) The p value computed at each step is computed ignoring all the previous 
steps. This can lead to incorrect inferences. The p value should be conditioned 
(i.e. computed taking into account) all the previous steps, i.e. the path 
used to get to the current model.
(2) When entering interaction terms, the three methods do not make sure that 
the main effects included in the interaction are in the model before the 
interaction is added.
(3) The analysis strategy substitutes the modeler's knowledge of the problem at 
hand for a thoughtless mechanical procedure.
(4) When your data are colinear (i.e. there is significant correlation among 
your independent variables) the three techniques you used may choose different 
models not because one model is better than the other, but rather because of 
colinearity.
(5) None of the three techniques gives absolute truth; each can give a 
glimpse of truth but they can also lead to a false sense of truth, a trip down 
a rabbit hole if you will.

Given these caveats, it remains to explain why your three analyses gave the 
same results. Fortunately the explanation is simple. The three analyses (one 
which starts with no terms in the model and then adds one at a time [forward], 
one that starts with all terms in the model and then removes terms one at a 
time [backward], and one that starts with no terms in the model and then adds 
and removes terms one at a time [stepwise]) all wound up with the same results. 
This is not a problem, it simply reflects the relations in your data. In fact 
the fact that the three methods all give the same result make me, at least, 
feel a bit better about the results you obtained with a method that is far from 
optimal.

In general it is considered better to model your data using standard modeling 
techniques that make use of your knowledge of the field rather than using one 
of the three techniques you used. This being said, sometimes when one has many 
independent variables, the three techniques can help one understand what is 
happening, but any inference drawn from the models must be taken with many a 
grain of salt.
John


John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)

 Frank Harrell f.harr...@vanderbilt.edu 6/10/2011 7:06 AM 
Which statistical principles are you invoking on which to base such analyses?
Frank

Sergio Della Franca wrote:
 
 Dear R-Helpers,
 
 I want to perform a logistic regression on my dataset (y).
 
 I used the following code:
 
 logistic-glm(formula=interest_variable~.,family = binomial(link =
 logit),
 data = y)
 
 
 This run correctly.
 Then i want to develop the logistic regression with three different
 method:
 -forward
 -backward
 -stepwise
 
 I used these procedure:
 forward-step(logistica,direction=forward)
 backward-step(logistica,direction=backward)
 stepwise-step(logistica,direction=both)
 
 Even these run correctly, but i obtained the same results with the three
 different procedures.
 
 Then I tought i made some mistakes.
 
 My question is:
 
 Is correct what i did?
 Is correct that three different methods return the same results?
 
 If i made some mistakes, what is the correct method to correctly perform
 the
 three different logistics regression?
 
 
 Thank you in advance.
 
 
 Sergio Della Franca.
 
   [[alternative HTML version deleted]]
 
 __
 r-h...@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help 
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html 
 and provide commented, minimal, self-contained, reproducible code.
 


-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: 
http://r.789695.n4.nabble.com/Logistic-Regression-tp821301p3588135.html 
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code.

Confidentiality Statement:
This email message, including any attachments, is for th...{{dropped:6}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2011-06-07 Thread Mike Marchywka





 Date: Tue, 7 Jun 2011 01:38:32 -0700
 From: farah.farid@student.aku.edu
 To: r-help@r-project.org
 Subject: [R] Logistic Regression

 I am working on my thesis in which i have couple of independent variables
 that are categorical in nature and the depndent variable is dichotomus.
 Initially I run univariate analysis and added the variables with significant
 p-values (p0.25) in my full model.
 I have three confusions. Firstly, I am looking for confounding variables by

I'm not sure what your thesis is about, some system that you are strying
by statistics or maybe the thesis is about statistics, but
according to this disputed wikipedia entry,

http://en.wikipedia.org/wiki/Confounding

confounding or extraneous is determined by the reality of your system.
It may help to consider factors related to that and use the statistics
to avoid fooling yourself. Look at the pictures ( non-pompous way of saying
look at graphs and scatter plots for some ideas to test ) and then test various
ideas. You see bad cause/effect inferences all the time in many fields- 
from econ to biotech ( although anecdotes suggest these mistakes usually
favour the sponsors LOL). Consider some mundane known examples about
what your data would look like and see if that relates to what you have.
If you were naively measuring car velocity 
at a single point in front of traffic light and color of light, what might you
observe ( much like with an earlier example on iron in patients, there are a 
number
of more precisely defined measurements you could take on a given thing.).

If your concern is that  I ran test A and it said B but test C said D and
D seems inconsistent with B it generally helps to look at assumptions and 
detailed
equations for each model and explore what those mean with your data. With 
continuous
variables anyway, non-monotonic relationships can easily destroy a correlation
even with strong causality. 


 using formula (crude beta-cofficient - adjusted beta-cofficient)/ crude
 beta-cofficient x 100 as per rule if the percentage of any variable is 10%
 than I have considered that as confounder. I wanted to know that from
 initial model i have deducted one variable with insignificant p-value to
 form adjusted model. Now how will i know if the variable that i deducted
 from initial model was confounder or not?
 Secondly, I wanted to know if the percentage comes in negative like
 (-17.84%) than will it be considered as confounder or not? I also wanted to
 know that confounders should be removed from model? or should be kept in
 model?
 Lastly, I wanted to know that I am running likelihood ratio test to identify
 if the value is falling in critical region or not. So if the value doesnot
 fall in critical region than what does it show? what should I do in this
 case? In my final reduced model all p-values are significant but still the
 value identified via likelihood ratio test is not falling in critical
 region. So what does that show?


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Logistic-Regression-tp3578962p3578962.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2011-06-07 Thread Frank Harrell
The 10% change idea was never a good one and has not been backed up by
simulations.  It is quite arbitrary and results in optimistic standard
errors of remaining variables.  In fact a paper presented at the Joint
Statistical Meetings about 3 years ago (I'm sorry I've forgotten the names
of the authors) showed that conflicting results are obtained according to
whether you apply the 10% to the coefficients or to the odds ratios, and
there is no theory to guide the choice.  Why risk residual confounding? 
Form a good model apriori and adjust for all potential confounders; don't
base the choice on P-values.  Use propensity scores if overfitting is an
issue.
Frank


farahnazlakhani wrote:
 
 I am working on my thesis in which i have couple of independent variables
 that are categorical in nature and the depndent variable is dichotomus. 
 Initially I run univariate analysis and added the variables with
 significant p-values (p0.25) in my full model. 
 I have three confusions. Firstly, I am looking for confounding variables
 by using formula (crude beta-cofficient - adjusted beta-cofficient)/
 crude beta-cofficient x 100 as per rule if the percentage of any variable
 is 10% than I have considered that as confounder. I wanted to know that
 from initial model i have deducted one variable with insignificant p-value
 to form adjusted model. Now how will i know if the variable that i
 deducted from initial model was confounder or not? 
 Secondly, I wanted to know if the percentage comes in negative like
 (-17.84%) than will it be considered as confounder or not? I also wanted
 to know that confounders should be removed from model? or should be kept
 in model?
 Lastly, I wanted to know that I am running likelihood ratio test to
 identify if the value is falling in critical region or not. So if the
 value doesnot fall in critical region than what does it show? what should
 I do in this case? In my final reduced model all p-values are significant
 but still the value identified via likelihood ratio test is not falling in
 critical region. So what does that show?
 


-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: 
http://r.789695.n4.nabble.com/Logistic-Regression-tp3578962p3579392.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression

2011-06-07 Thread Bert Gunter
IMHO, you evidence considerable confusion and misunderstanding of
statistical methods. I would say that most of what you describe is
nonsense. Of course, maybe I'm just the one who's confused, but I
would strongly suggest you consult with a local statistician. This
list is unlikely to be able to provide you the level of support and
understanding that you need.

Cheers,
Bert

On Tue, Jun 7, 2011 at 1:38 AM, farahnazlakhani
farah.farid@student.aku.edu wrote:
 I am working on my thesis in which i have couple of independent variables
 that are categorical in nature and the depndent variable is dichotomus.
 Initially I run univariate analysis and added the variables with significant
 p-values (p0.25) in my full model.
 I have three confusions. Firstly, I am looking for confounding variables by
 using formula (crude beta-cofficient - adjusted beta-cofficient)/ crude
 beta-cofficient x 100 as per rule if the percentage of any variable is 10%
 than I have considered that as confounder. I wanted to know that from
 initial model i have deducted one variable with insignificant p-value to
 form adjusted model. Now how will i know if the variable that i deducted
 from initial model was confounder or not?
 Secondly, I wanted to know if the percentage comes in negative like
 (-17.84%) than will it be considered as confounder or not? I also wanted to
 know that confounders should be removed from model? or should be kept in
 model?
 Lastly, I wanted to know that I am running likelihood ratio test to identify
 if the value is falling in critical region or not. So if the value doesnot
 fall in critical region than what does it show? what should I do in this
 case? In my final reduced model all p-values are significant but still the
 value identified via likelihood ratio test is not falling in critical
 region. So what does that show?


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Logistic-Regression-tp3578962p3578962.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Men by nature long to get on to the ultimate truths, and will often
be impatient with elementary studies or fight shy of them. If it were
possible to reach the ultimate truths without the elementary studies
usually prefixed to them, these would not be preparatory studies but
superfluous diversions.

-- Maimonides (1135-1204)

Bert Gunter
Genentech Nonclinical Biostatistics

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression lrm() output

2011-05-18 Thread Frank Harrell
Why is a one unit change in x an interesting range for the purpose of
estimating an odds ratio?

The default in summary() is the inter-quartile-range odds ratio as clearly
stated in the rms documentation.
Frank

array chip wrote:
 
 Hi, I am trying to run a simple logistic regression using lrm() to
 calculate a 
 odds ratio. I found a confusing output when I use summary() on the fit
 object 
 which gave some OR that is totally different from simply taking 
 exp(coefficient), see below:
 
 dat-read.table(dat.txt,sep='\t',header=T,row.names=NULL)
 
 d-datadist(dat)
 options(datadist='d')
 library(rms)
 (fit-lrm(response~x,data=dat,x=T,y=T))
 
 Logistic Regression Model
 lrm(formula = response ~ x, data = dat, x = T, y = T)
 
   Model Likelihood DiscriminationRank Discrim.
  Ratio TestIndexes  Indexes   
 
 Obs   150LR chi2  17.11R2   0.191C   0.763
  0128d.f. 1g1.209Dxy 0.526
  1 22Pr( chi2) 0.0001gr   3.350gamma   0.528
 max |deriv| 1e-11  gp   0.129tau-a   0.132
Brier0.111 
 
   CoefS.E.   Wald Z Pr(|Z|)
 Intercept -5.0059 0.9813 -5.10  0.0001 
 x  0.5647 0.1525  3.70  0.0002 
 
 As you can see, the odds ratio for x is exp(0.5647)=1.75892.
 
 But if I run the following using summary():
 
 summary(fit)
  Effects  Response : response 
 
  Factor  LowHigh   Diff.  Effect S.E. Lower 0.95 Upper 0.95
  x   3.9003 6.2314 2.3311 1.32   0.36 0.62   2.01  
   Odds Ratio 3.9003 6.2314 2.3311 3.73 NA 1.86   7.49
 
 What are these output? none of the numbers is the odds ratio (1.75892)
 that I 
 calculated by using exp().
 
 Can any explain?
 
 Thanks
 
 John
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


-
Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-lrm-output-tp3533223p3533278.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with glm: cooks distance and dfbetas are different compared to SPSS output

2011-05-02 Thread Uwe Ligges



On 29.04.2011 18:29, Biedermann, Jürgen wrote:

Hi there,

I have the problem, that I'm not able to reproduce the SPSS residual
statistics (dfbeta and cook's distance) with a simple binary logistic
regression model obtained in R via the glm-function.

I tried the following:

fit - glm(y ~ x1 + x2 + x3, data, family=binomial)

cooks.distance(fit)#


Just type stats::cooks.distance.glm and see the definition in R yourself:

function (model, infl = influence(model, do.coef = FALSE), res = 
infl$pear.res, dispersion = summary(model)$dispersion, hat = infl$hat, ...)

{
p - model$rank
res - (res/(1 - hat))^2 * hat/(dispersion * p)
res[is.infinite(res)] - NaN
res
}
environment: namespace:stats

Now you can digg yourself further on. I do not know how to find the 
actually used algorithm from SPSS, hence I cannot tell what is different.


Uwe Ligges




dfbetas(fit)

When i compare the returned values with the values that I get in SPSS,
they are different, although the same model is calculated (the
coefficients are the same etc.)

It seems that different calculation-formulas are used for cooks.distance
and dfbetas in SPSS compared to R.

Unfortunately I didn't find out, what's the difference in the
calculation and how I could get R to calculate me the same statistics
that SPSS uses.
Or is this an unknown SPSS bug?

Greetings
Jürgen

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression: wls and unbalanced samples

2011-04-27 Thread peter dalgaard

On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote:

 Greetings from Rio de Janeiro, Brazil.
 
 I am looking for advice / references on binary logistic regression
 with weighted least squares (using lrm  weights), on the following
 context:
 
 1) unbalanced sample (n0=1, n1=700);
 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e
 3) after modelling, adjust the intercept in order to reflect the
 expected % of 1’s in the population (e.g., circa 7%, as opposed to
 50%).

??

If the proportion of 1 in the population is about 7%, how exactly is the sample 
unbalanced. I don't see a reason to use weights at all if the sample is 
representative of the population. The opposite situation, where the sample is 
balanced (e.g. case-control), the population not, and you are interested in the 
population values, _that_ might require weighting, with some care because case 
weighting and sample weighting are two different things so the s.e. will be 
wrong. That sort of stuff handled by the survey package. 

However what you seem to be doing is to create results for an artificial 50/50 
population, then project back to the population you were sampling from all 
along. I don't think this makes sense at all.

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression: wls and unbalanced samples

2011-04-27 Thread Prof Brian Ripley

On Wed, 27 Apr 2011, peter dalgaard wrote:


On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote:


Greetings from Rio de Janeiro, Brazil.

I am looking for advice / references on binary logistic regression
with weighted least squares (using lrm  weights), on the following
context:

1) unbalanced sample (n0=1, n1=700);
2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e
3) after modelling, adjust the intercept in order to reflect the
expected % of 1’s in the population (e.g., circa 7%, as opposed to
50%).


??

If the proportion of 1 in the population is about 7%, how exactly is 
the sample unbalanced. I don't see a reason to use weights at all 
if the sample is representative of the population. The opposite 
situation, where the sample is balanced (e.g. case-control), the 
population not, and you are interested in the population values, 
_that_ might require weighting, with some care because case 
weighting and sample weighting are two different things so the s.e. 
will be wrong. That sort of stuff handled by the survey package.


However what you seem to be doing is to create results for an 
artificial 50/50 population, then project back to the population you 
were sampling from all along. I don't think this makes sense at all.


There are circumstances where it might.  It is quite common in pattern 
recognition for the proportions in the training set to not reflect the 
population.  And if the misclassification costs are asymmetric, you 
may want to weight the fit.


The case I encountered was SGA births.  By definition there are about 
10% 'successes', but false negatives are far more important than false 
positives (or one would simply predict all births as normal).  This 
means that you want accurate estimation of probabilities in the right 
tail of the population distribution, and plug-in estimation of 
logistic regression is biased.  One of many ways to reduce that bias 
is to re-weight the training set so the estimated probabilities of 
marginal cases are in the middle of the range.


Note that logistic regression is not normally fitted by 'weighted 
least squares' (not even by 'lrm' from some unstated package).


This is not a list for tutorials in advanced statistics, but one 
reference is my Pattern Recognition and Neural Networks book.




--
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression: wls and unbalanced samples

2011-04-27 Thread Andre Guimaraes
Many thanks for your messages.

I will take a look at the survey package.
I was concerned with the issues raised by Cramer (1999) in Predictive
performance of the binary logit model in unbalanced samples.

In this particular case, misclassification costs are much higher for
the smaller group (defaults) than for the larger group (non-defaults).
However, I have no specific guidelines for how much higher. If I
understood correctly, using sampling weights would help improve
accuracy on the smaller group and, at least, I would be able to
explain the rationale for the different weights.

To cite properly, I was referring to lrm in the Design package
(Harrel, 2008). Sorry to have intruded the list with such question,
but - once again - thank you for your answers.

On Wed, Apr 27, 2011 at 7:29 AM, Prof Brian Ripley
rip...@stats.ox.ac.uk wrote:
 On Wed, 27 Apr 2011, peter dalgaard wrote:

 On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote:

 Greetings from Rio de Janeiro, Brazil.

 I am looking for advice / references on binary logistic regression
 with weighted least squares (using lrm  weights), on the following
 context:

 1) unbalanced sample (n0=1, n1=700);
 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e
 3) after modelling, adjust the intercept in order to reflect the
 expected % of 1’s in the population (e.g., circa 7%, as opposed to
 50%).

 ??

 If the proportion of 1 in the population is about 7%, how exactly is the
 sample unbalanced. I don't see a reason to use weights at all if the
 sample is representative of the population. The opposite situation, where
 the sample is balanced (e.g. case-control), the population not, and you are
 interested in the population values, _that_ might require weighting, with
 some care because case weighting and sample weighting are two different
 things so the s.e. will be wrong. That sort of stuff handled by the survey
 package.

 However what you seem to be doing is to create results for an artificial
 50/50 population, then project back to the population you were sampling from
 all along. I don't think this makes sense at all.

 There are circumstances where it might.  It is quite common in pattern
 recognition for the proportions in the training set to not reflect the
 population.  And if the misclassification costs are asymmetric, you may want
 to weight the fit.

 The case I encountered was SGA births.  By definition there are about 10%
 'successes', but false negatives are far more important than false positives
 (or one would simply predict all births as normal).  This means that you
 want accurate estimation of probabilities in the right tail of the
 population distribution, and plug-in estimation of logistic regression is
 biased.  One of many ways to reduce that bias is to re-weight the training
 set so the estimated probabilities of marginal cases are in the middle of
 the range.

 Note that logistic regression is not normally fitted by 'weighted least
 squares' (not even by 'lrm' from some unstated package).

 This is not a list for tutorials in advanced statistics, but one reference
 is my Pattern Recognition and Neural Networks book.


 --
 Peter Dalgaard
 Center for Statistics, Copenhagen Business School
 Solbjerg Plads 3, 2000 Frederiksberg, Denmark
 Phone: (+45)38153501
 Email: pd@cbs.dk  Priv: pda...@gmail.com

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 --
 Brian D. Ripley,                  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford,             Tel:  +44 1865 272861 (self)
 1 South Parks Road,                     +44 1865 272866 (PA)
 Oxford OX1 3TG, UK                Fax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression Fitting with EM-Algorithm

2011-01-10 Thread Robin Aly

Dear Ted,

sorry for being unclear. Let me try again.

I indeed have no knowledge about the value of the response variable for 
any object.

Instead, I have a data frames of explanatory variables for
each object. For example,

x1   x2   x3
1   4.409974 2.348745 1.9845313
2   3.809249 2.281260 1.9170466
3   4.229544 2.610347 0.9127431
4   4.259644 1.866025 1.5982859
5   4.001306 2.225069 1.2551570
...

, and I want to model a regression model of the form y ~ x1 + x2 + x3.

From prior information I know that all coefficients are approximately 
Gaussian distributed around one and the same for the intercept around 
-10. Now I think there must be a package which estimates the 
coefficients more precisely by fitting the logistic regression function 
to the data without knowledge of the response variable (similar to 
fitting Gaussians in a mixture model where the class labels are unknown).


I looked at the flexmix package but this seems to only find 
dependencies in the data assuming the presence of some training data.
I also found some evidence In Magder1997 (see below) that such an 
algorithm exists, however from the documented math I can't apply the 
method to my problem.


Thanks in advance,
Best Regards
Robin

Magder, L. S.  Hughes, J. P. Logistic Regression When the Outcome Is 
Measured with Uncertainty American Journal of Epidemiology, 1997, 146, 
195-203





On 01/04/2011 12:36 AM, (Ted Harding) wrote:

On 03-Jan-11 14:02:21, Robin Aly wrote:

Hi all,
is there any package which can do an EM algorithm fitting of
logistic regression coefficients given only the explanatory
variables? I tried to realize this using the Design package,
but I didn't find a way.

Thanks a lot  Kind regards
Robin Aly

As written, this is a strange question! You imply that you
do not have data on the response (0/1) variable at all,
only on the explanatory variables. In that case there is
no possible estimate, because that would require data on
at least some of the values of the response variable.

I think you should explain more clearly and explicitly what
the information is that you have for all the variables.

Ted.


E-Mail: (Ted Harding)ted.hard...@wlandres.net
Fax-to-email: +44 (0)870 094 0861
Date: 03-Jan-11   Time: 23:36:56
-- XFMail --


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic Regression Fitting with EM-Algorithm

2011-01-10 Thread Ted Harding
In view of your further explanation, Robin, the best I can offer
is the following.

[1] Theoretical frame.
*IF* variables (X1,X2,X3) are distributed according to a
mixture of two multivariate normal distributions, i.e. as
two groups, each with a multivariate normal distribution,
*AND* the members of one group are labelled Y=0 and the
members of the other group are labelled Y=1, *THEN* for
a unit chosen at random from the two groups (pooled) the
probability that Y=1 conditional on (X1=x1,X2=x2,X3=x3)
follows a logistic regression. This regression will be
linear in (x1,x2,x3) if the two multivariate normals have
the same covariance matrix; it will be quadratic if the
two covariance matrices are different. The coefficients
in the regression will be algebraic expressions involving
these parameters of the two multivariates normals, together
with the two proportions p1 and p2 of the two groups.

This result is a straightforward algebraic consequence of
applying Bayes's Theorem.

[2] Practical application
If you can identify that the data on (X1,X2,X3) correspond
to a mixture of two multivariate normal distributions whose
parameters (two multivariate mean vectors, one or two
covariance matrices, proportions in the two groups) you can
estimate, *AND* *IF* you are justified in assuming that the
*unobserved* response variable Y takes the value 0 for one
group and 1 for the other, *THEN* you can apply logistic
regression to the results (but you will not learn anything by
doing so that was not already available from the estimated
parameters, and the algebraic expression of the logistic
coefficients, as found in [1] above).

[3] Caveat
Being able to perform the identification and estimation of
the two multivariate normals as in [2], by using some mixture
identification method, does *NOT* of itself justify making
the assumption in [2] that the unobserved response variable
Y takes values 0 and 1 according to group membership *UNLESS*
that is what you precisely mean by Y (i.e. index of group
membership in one or other of two multikvariate normals).
If the meaning of variable Y is different, then success with
a mixture algorithm may have nothing to do with what the values
of Y are likely to be.

[4] Comment
Many algorithms for identifying mixtures are based on the
EM algorithm. Your additional prior information about how
the coefficients are distributed could be incorporated into
the EM algorithm, but I can't think explicitly of an R function
which would enable this (though the MCMC methods associated
with BRugs -- the R interface to OpenBUGS -- may allow you to
set this up). Probably others can offer more help on this aspect
of the matter.

I think it is necessary to be absolutely clear about what
your model represents!

Hoping this helps,
Ted.

On 10-Jan-11 20:08:09, Robin Aly wrote:
 Dear Ted,
 
 sorry for being unclear. Let me try again.
 
 I indeed have no knowledge about the value of the response
 variable for any object.
 Instead, I have a data frames of explanatory variables for
 each object. For example,
 
  x1   x2   x3
 1   4.409974 2.348745 1.9845313
 2   3.809249 2.281260 1.9170466
 3   4.229544 2.610347 0.9127431
 4   4.259644 1.866025 1.5982859
 5   4.001306 2.225069 1.2551570
 ...
 
 , and I want to model a regression model of the form
  y ~ x1 + x2 + x3.
 
 From prior information I know that all coefficients are
 approximately Gaussian distributed around one and the same
 for the intercept around -10. Now I think there must be a
 package which estimates the coefficients more precisely by
 fitting the logistic regression function to the data without
 knowledge of the response variable (similar to fitting
 Gaussians in a mixture model where the class labels are
 unknown).
 
 I looked at the flexmix package but this seems to only
 find  dependencies in the data assuming the presence of some
 training data.
 I also found some evidence In Magder1997 (see below) that
 such an algorithm exists, however from the documented math
 I can't apply the method to my problem.
 
 Thanks in advance,
 Best Regards
 Robin
 
 Magder, L. S.  Hughes, J. P. Logistic Regression When the Outcome Is 
 Measured with Uncertainty American Journal of Epidemiology, 1997, 146, 
 195-203
 
 
 
 
 On 01/04/2011 12:36 AM, (Ted Harding) wrote:
 On 03-Jan-11 14:02:21, Robin Aly wrote:
 Hi all,
 is there any package which can do an EM algorithm fitting of
 logistic regression coefficients given only the explanatory
 variables? I tried to realize this using the Design package,
 but I didn't find a way.

 Thanks a lot  Kind regards
 Robin Aly
 As written, this is a strange question! You imply that you
 do not have data on the response (0/1) variable at all,
 only on the explanatory variables. In that case there is
 no possible estimate, because that would require data on
 at least some of the values of the response variable.

 I think you should explain more clearly and explicitly what
 the information is that you have for all the 

Re: [R] Logistic Regression Fitting with EM-Algorithm

2011-01-03 Thread Ted Harding
On 03-Jan-11 14:02:21, Robin Aly wrote:
 Hi all,
 is there any package which can do an EM algorithm fitting of
 logistic regression coefficients given only the explanatory
 variables? I tried to realize this using the Design package,
 but I didn't find a way.
 
 Thanks a lot  Kind regards
 Robin Aly

As written, this is a strange question! You imply that you
do not have data on the response (0/1) variable at all,
only on the explanatory variables. In that case there is
no possible estimate, because that would require data on
at least some of the values of the response variable.

I think you should explain more clearly and explicitly what
the information is that you have for all the variables.

Ted.


E-Mail: (Ted Harding) ted.hard...@wlandres.net
Fax-to-email: +44 (0)870 094 0861
Date: 03-Jan-11   Time: 23:36:56
-- XFMail --

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression with response 0,1

2010-12-29 Thread Dennis Murphy
Hi:

I think you created a problem for yourself in the way you generated your
data.

y-rbinom(2000,1,.7)
euro - rnorm(2000, m = 300 * y + 50 * (1 - y), s = 20 * y + 12 * (1 - y))
# Create a 2000 x 2 matrix of probabilities
prmat - cbind(0.8 * y + 0.2 * (1 - y), 0.2 * y + 0.8 * (1 - y))
# sample sex from each row of prmat with the rows comprising the
distribution
sex - apply(prmat, 1, function(x) sample(c('m', 'f'), 1, prob = x))

df - data.frame(euro, sex, y)

# Histogram of euro: notice the separation in distributions
hist(euro, nclass = 50)
# Generate an indicator between the two clusters of euro
spl - euro  150
# Now show a table of that split vs. response
table(spl, y)

This is what I get for my simulation:

table(spl, y)
   y
spl01
  FALSE  5720
  TRUE 0 1428

which in turn leads to

m - glm(y ~ euro + sex, data = df, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: fitted probabilities numerically 0 or 1 occurred

This is what is known as 'complete data separation' in the logistic
regression literature. Basically, you've generated data so that all the
successes are associated with a N(300, 20) distribution and all the failures
with a N(50, 12) distribution. If this is a classification problem,
congratulations - the margin on the support vector will be huge :)  OTOH, if
you're trying to fit a logistic model for purposes of explanation, you've
created a problem, especially with respect to prediction.

...and it does matter whether this is a regression problem or a
classification problem. In the latter, separation is a good thing; in the
former, it creates convergence problems.

Since you have a continuous predictor in the model, there is an additional
complication: the logistic regression null deviance does not have an
asymptotic chi-square distribution, so tests involving reductions of
deviance from the null model are not guaranteed to have asymptotic
chi-square distributions *when the predictor x is truly continuous*.

More below.

On Wed, Dec 29, 2010 at 9:48 AM, Federico Bonofiglio bonori...@gmail.comwrote:

 Dear Masters,
 first I'd like to wish u all a great 2011 and happy holydays by now,

 second (here it come the boring stuff) I have a question to which I hope u
 would answer:

 I run a logistic regression by glm(), on the following data type
 (y1=1,x1=x1); (y2=0,x2=x2);..(yn=0,xn=xn), where the response (y) is
 abinary outcome on 0,1 amd x is any explanatory variable (continuous or
 not)
 observed with the i-th outcome.

 This is indeed one of the most frequent case when challenged with binary
 responses, though I know R manages such responses slightly differently (a
 vector for the successes counts and one for the failures) and I'm not sure
 wheather my summary.glm gives me any senseful answer at all

 for the purpose I have tried to to emphasize the numbers so to obtain
 significant results

 y-rbinom(2000,1,.7)#response

 for(i in 1:2000){
 euro[i]-if(y[i]==1){rnorm(1,300,20)#explanatory 1
 }else{rnorm(1,50,12)}
 }

 for(i in 1:2000){
 sex[i]-if(y[i]==1){sample(c(m,f),1,prob=c(.8,.2))#explanatory 2
 }else{sample(c(m,f),1,prob=c(.2,.8))}
 }



 m-glm(y~euro+factor(sex),family=binomial)

 summary(m)




 My worries:

   - are the estimates correct?


The people who wrote the glm() routine were smart enough to anticipate the
data separation case and are warning you of potential instability in the
model estimates/predictions as a result of producing predictions of exactly
0 or 1. This is a warning to take seriously -  your generated data produced
these based on x alone.

  -  degrees of freedom exponentiate dramatically (one per cell) , so may I
   risk to never obtain a significant result?


When using grouped or ungrouped data, comparisons between nested models will
be the same whether the data are grouped or ungrouped in non-pathological
situations.


 I also take the chance to ask wheater u know any implemented method to plot
 logistic curves directly out of a glm() model



The following is an example to illustrate some of the questions you raised.


# Example to illustrate the difference between grouped and ungrouped
# logistic regression analyses

library(reshape2)
library(lattice)

# Sample 50 distinct x values 300 times
x - sample(1:50, 300, replace = TRUE)
# P(Y = 1 | X)  increases with x
y - rbinom(300, 1, (10 + x)/80)
ind - x  25
# males sampled more heavily when x  25
p - cbind(0.7 * ind + 0.3 * (1 - ind), 0.3 * ind + 0.7 * (1 - ind))
sex - apply(p, 1, function(x) sample(c('m', 'f'), 1, prob = x))
df - data.frame(sex, x, y)

# Ungrouped logistic regression
# treat x as a continuous covariate
m1 - glm(y ~ sex + x, data = df, family = binomial)

# Group the data by x * sex combinations
u - as.data.frame(xtabs(~ y + sex + x, data = df))
# cast() reshapes the data so that the 0/1 frequencies become separate
columns
# cast() comes from package reshape(2)
# In this case, sex and x are ID variables, and y is reshaped 

Re: [R] logistic regression or not?

2010-12-21 Thread Ben Bolker
array chip arrayprofile at yahoo.com writes:

[snip]

 I can think of analyzing this data using glm() with the attached dataset:
 
 test-read.table('test.txt',sep='\t')
 fit-glm(cbind(positive,total-positive)~treatment,test,family=binomial)
 summary(fit)
 anova(fit, test='Chisq')
 
 First, is this still called logistic regression or something else? I thought 
 with logistic regression, the response variable is a binary factor?

  Sometimes I've seen it called binomial regression, or just 
a binomial generalized linear model

 Second, then summary(fit) and anova(fit, test='Chisq') gave me different p 
 values, why is that? which one should I use?

  summary(fit) gives you p-values from a Wald test.
  anova() gives you tests based on the Likelihood Ratio Test.
  In general the LRT is more accurate.

 Third, is there an equivalent model where I can use variable percentage 
 instead of positive  total?

  glm(percentage~treatment,weights=total,data=tests,family=binomial)

 is equivalent to the model you fitted above.
 
 Finally, what is the best way to analyze this kind of dataset 
 where it's almost the same as ANOVA except that the response variable
  is a proportion (or success and failure)?

  Don't quite know what you mean here.  How is the situation almost
the same as ANOVA different from the situation you described above?
Do you mean when there are multiple factors? or ???

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression or not?

2010-12-21 Thread S Ellison
A possible caveat here.

Traditionally, logistic regression was performed on the
logit-transformed proportions, with the standard errors based on the
residuals for the resulting linear fit. This accommodates overdispersion
naturally, but without telling you that you have any.

glm with a binomial family does not allow for overdispoersion unless
you use the quasibinomial family. If you have overdispersion, standard
errors from glm will be unrealistically small. Make sure your model fits
in glm before you believe the standard errors, or use the quasibionomial
family.

Steve Ellison
LGC



 Ben Bolker bbol...@gmail.com 21/12/2010 13:08:34 
array chip arrayprofile at yahoo.com writes:

[snip]

 I can think of analyzing this data using glm() with the attached
dataset:
 
 test-read.table('test.txt',sep='\t')

fit-glm(cbind(positive,total-positive)~treatment,test,family=binomial)
 summary(fit)
 anova(fit, test='Chisq')
 
 First, is this still called logistic regression or something else? I
thought 
 with logistic regression, the response variable is a binary factor?

  Sometimes I've seen it called binomial regression, or just 
a binomial generalized linear model

 Second, then summary(fit) and anova(fit, test='Chisq') gave me
different p 
 values, why is that? which one should I use?

  summary(fit) gives you p-values from a Wald test.
  anova() gives you tests based on the Likelihood Ratio Test.
  In general the LRT is more accurate.

 Third, is there an equivalent model where I can use variable
percentage 
 instead of positive  total?

  glm(percentage~treatment,weights=total,data=tests,family=binomial)

 is equivalent to the model you fitted above.
 
 Finally, what is the best way to analyze this kind of dataset 
 where it's almost the same as ANOVA except that the response
variable
  is a proportion (or success and failure)?

  Don't quite know what you mean here.  How is the situation almost
the same as ANOVA different from the situation you described above?
Do you mean when there are multiple factors? or ???

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html 
and provide commented, minimal, self-contained, reproducible code.

***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression or not?

2010-12-21 Thread peter dalgaard

On Dec 21, 2010, at 14:22 , S Ellison wrote:

 A possible caveat here.
 
 Traditionally, logistic regression was performed on the
 logit-transformed proportions, with the standard errors based on the
 residuals for the resulting linear fit. This accommodates overdispersion
 naturally, but without telling you that you have any.
 
 glm with a binomial family does not allow for overdispoersion unless
 you use the quasibinomial family. If you have overdispersion, standard
 errors from glm will be unrealistically small. Make sure your model fits
 in glm before you believe the standard errors, or use the quasibionomial
 family.

...and before you believe in overdispersion, make sure you have a credible 
explanation for it. All too often, what you really have is a model that doesn't 
fit your data properly.

 
 Steve Ellison
 LGC
 
 
 
 Ben Bolker bbol...@gmail.com 21/12/2010 13:08:34 
 array chip arrayprofile at yahoo.com writes:
 
 [snip]
 
 I can think of analyzing this data using glm() with the attached
 dataset:
 
 test-read.table('test.txt',sep='\t')
 
 fit-glm(cbind(positive,total-positive)~treatment,test,family=binomial)
 summary(fit)
 anova(fit, test='Chisq')
 
 First, is this still called logistic regression or something else? I
 thought 
 with logistic regression, the response variable is a binary factor?
 
  Sometimes I've seen it called binomial regression, or just 
 a binomial generalized linear model
 
 Second, then summary(fit) and anova(fit, test='Chisq') gave me
 different p 
 values, why is that? which one should I use?
 
  summary(fit) gives you p-values from a Wald test.
  anova() gives you tests based on the Likelihood Ratio Test.
  In general the LRT is more accurate.
 
 Third, is there an equivalent model where I can use variable
 percentage 
 instead of positive  total?
 
  glm(percentage~treatment,weights=total,data=tests,family=binomial)
 
 is equivalent to the model you fitted above.
 
 Finally, what is the best way to analyze this kind of dataset 
 where it's almost the same as ANOVA except that the response
 variable
 is a proportion (or success and failure)?
 
  Don't quite know what you mean here.  How is the situation almost
 the same as ANOVA different from the situation you described above?
 Do you mean when there are multiple factors? or ???
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help 
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html 
 and provide commented, minimal, self-contained, reproducible code.
 
 ***
 This email and any attachments are confidential. Any use...{{dropped:8}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression or not?

2010-12-21 Thread S Ellison

...and before you believe in overdispersion, make sure you have a
credible explanation for it. All too often, what you really have 
is a model that doesn't fit your data properly.

Well put.

A possible fortune?

S Ellison



***
This email and any attachments are confidential. Any use...{{dropped:8}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression or not?

2010-12-21 Thread array chip
Thank you Ben, Steve and Peter. 

Ben, my last question was to see if there are other ways of analyzing this type 
of data where the response variable is a proportion, in addition to binomial 
regression. 


BTW, I also found the following is also an equivalent model directly using 
percentage:

glm(log(percentage/(1-percentage))~treatment,data=test)

Thanks

John

 




From: Ben Bolker bbol...@gmail.com
To: r-h...@stat.math.ethz.ch
Sent: Tue, December 21, 2010 5:08:34 AM
Subject: Re: [R] logistic regression or not?

array chip arrayprofile at yahoo.com writes:

[snip]

 I can think of analyzing this data using glm() with the attached dataset:
 
 test-read.table('test.txt',sep='\t')
 fit-glm(cbind(positive,total-positive)~treatment,test,family=binomial)
 summary(fit)
 anova(fit, test='Chisq')

 First, is this still called logistic regression or something else? I thought 
 with logistic regression, the response variable is a binary factor?

  Sometimes I've seen it called binomial regression, or just 
a binomial generalized linear model

 Second, then summary(fit) and anova(fit, test='Chisq') gave me different p 
 values, why is that? which one should I use?

  summary(fit) gives you p-values from a Wald test.
  anova() gives you tests based on the Likelihood Ratio Test.
  In general the LRT is more accurate.

 Third, is there an equivalent model where I can use variable percentage
 instead of positive  total?

  glm(percentage~treatment,weights=total,data=tests,family=binomial)

is equivalent to the model you fitted above.
 
 Finally, what is the best way to analyze this kind of dataset 
 where it's almost the same as ANOVA except that the response variable
  is a proportion (or success and failure)?

  Don't quite know what you mean here.  How is the situation almost
the same as ANOVA different from the situation you described above?
Do you mean when there are multiple factors? or ???

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression or not?

2010-12-21 Thread Ben Bolker
On 10-12-21 12:20 PM, array chip wrote:
 Thank you Ben, Steve and Peter.
  
 Ben, my last question was to see if there are other ways of analyzing
 this type of data where the response variable is a proportion, in
 addition to binomial regression.
  
 BTW, I also found the following is also an equivalent model directly
 using percentage:
  
 glm(log(percentage/(1-percentage))~treatment,data=test)
  
 Thanks
  
 John
 

  Yes, but this is a different model.

  The model you have here uses Gaussian errors (it is in fact an
identical model, although not necessarily quite an identical algorithm
(?), to just using lm().  It will fail if you have any percentages that
are 0 or 1.  See Stuart's comment about how things were done in the old
days.

  Beta regression (see e.g. the betareg package) is another way of
handling analysis of proportions.

 
 
 *From:* Ben Bolker bbol...@gmail.com
 *To:* r-h...@stat.math.ethz.ch
 *Sent:* Tue, December 21, 2010 5:08:34 AM
 *Subject:* Re: [R] logistic regression or not?
 
 array chip arrayprofile at yahoo.com http://yahoo.com/ writes:
 
 [snip]
 
 I can think of analyzing this data using glm() with the attached dataset:

 test-read.table('test.txt',sep='\t')
 fit-glm(cbind(positive,total-positive)~treatment,test,family=binomial)
 summary(fit)
 anova(fit, test='Chisq')
 
 First, is this still called logistic regression or something else? I
 thought
 with logistic regression, the response variable is a binary factor?
 
   Sometimes I've seen it called binomial regression, or just
 a binomial generalized linear model
 
 Second, then summary(fit) and anova(fit, test='Chisq') gave me
 different p
 values, why is that? which one should I use?
 
   summary(fit) gives you p-values from a Wald test.
   anova() gives you tests based on the Likelihood Ratio Test.
   In general the LRT is more accurate.
 
 Third, is there an equivalent model where I can use variable percentage
 instead of positive  total?
 
   glm(percentage~treatment,weights=total,data=tests,family=binomial)
 
 is equivalent to the model you fitted above.

 Finally, what is the best way to analyze this kind of dataset
 where it's almost the same as ANOVA except that the response variable
  is a proportion (or success and failure)?
 
   Don't quite know what you mean here.  How is the situation almost
 the same as ANOVA different from the situation you described above?
 Do you mean when there are multiple factors? or ???
 
 __
 R-help@r-project.org mailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression or not?

2010-12-21 Thread array chip
Ben, thanks again.

John





From: Ben Bolker bbol...@gmail.com

Cc: r-h...@stat.math.ethz.ch; S Ellison s.elli...@lgc.co.uk; peter dalgaard 
pda...@gmail.com
Sent: Tue, December 21, 2010 9:26:29 AM
Subject: Re: [R] logistic regression or not?

On 10-12-21 12:20 PM, array chip wrote:
 Thank you Ben, Steve and Peter.
  
 Ben, my last question was to see if there are other ways of analyzing
 this type of data where the response variable is a proportion, in
 addition to binomial regression.
  
 BTW, I also found the following is also an equivalent model directly
 using percentage:
  
 glm(log(percentage/(1-percentage))~treatment,data=test)
  
 Thanks
  
 John
 

  Yes, but this is a different model.

  The model you have here uses Gaussian errors (it is in fact an
identical model, although not necessarily quite an identical algorithm
(?), to just using lm().  It will fail if you have any percentages that
are 0 or 1.  See Stuart's comment about how things were done in the old
days.

  Beta regression (see e.g. the betareg package) is another way of
handling analysis of proportions.

 
 
 *From:* Ben Bolker bbol...@gmail.com
 *To:* r-h...@stat.math.ethz.ch
 *Sent:* Tue, December 21, 2010 5:08:34 AM
 *Subject:* Re: [R] logistic regression or not?
 
 array chip arrayprofile at yahoo.com http://yahoo.com/ writes:
 
 [snip]
 
 I can think of analyzing this data using glm() with the attached dataset:

 test-read.table('test.txt',sep='\t')
 fit-glm(cbind(positive,total-positive)~treatment,test,family=binomial)
 summary(fit)
 anova(fit, test='Chisq')
 
 First, is this still called logistic regression or something else? I
 thought
 with logistic regression, the response variable is a binary factor?
 
  Sometimes I've seen it called binomial regression, or just
 a binomial generalized linear model
 
 Second, then summary(fit) and anova(fit, test='Chisq') gave me
 different p
 values, why is that? which one should I use?
 
  summary(fit) gives you p-values from a Wald test.
  anova() gives you tests based on the Likelihood Ratio Test.
  In general the LRT is more accurate.
 
 Third, is there an equivalent model where I can use variable percentage
 instead of positive  total?
 
  glm(percentage~treatment,weights=total,data=tests,family=binomial)
 
 is equivalent to the model you fitted above.

 Finally, what is the best way to analyze this kind of dataset
 where it's almost the same as ANOVA except that the response variable
  is a proportion (or success and failure)?
 
  Don't quite know what you mean here.  How is the situation almost
 the same as ANOVA different from the situation you described above?
 Do you mean when there are multiple factors? or ???
 
 __
 R-help@r-project.org mailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Logistic regression with factorial effect

2010-11-18 Thread Bert Gunter
You would be better off posting to R-sig-mixed-models or R-sig-ecology

-- Bert

On Thu, Nov 18, 2010 at 9:32 AM, Billy.Requena billy.requ...@gmail.com wrote:

 Hello,

 I’d like to evaluate the temporal effect on the relationship between a
 continuous variable (e.g. size) and the probability of mate success.
 Initially I was trying to do a logistic regression model incorporating the
 temporal effect, but I don’t know if that is the best option. I simulated
 some data and that’s the problem:


 rep(c(Jan,Feb,Mar,Apr,May), each=20) - month
 as.factor(month)

 rep(LETTERS[seq(1:20)], 5) - ind

 rep(sort(rnorm(20, 5.5, 0.2)), 5) - size
 size

 c(c(rep(0,12), rep(1,8)), c(rep(0,12), rep(1,8)),
        c(rep(c(0,1), 10)),
        c(rep(1,8), rep(0,12)),
        c(rep(1,8), rep(0,12))) - success1
 success1

 With the object ‘success1’, only the highest values of size are successful
 at the two first months, but only the lowest values of size are successful
 at the two last months. So, the overall effect of size on the successful
 probability should not exist, but if we consider the interaction between
 size and time, we should be able to see that effect.


 glm(success1 ~ size, family=binomial) - test1.1
 glmer(success1 ~ size + (1|ind), family=binomial) - test2.1
 glmer(success1 ~ size + month + (1|ind), family=binomial) - test3.1
 glmer(success1 ~ size : month + (1|ind), family=binomial) - test4.1


 However, the expected result is not observed in the output of all these
 models. Using a model selection approach and comparing the AIC values of all
 models, it seems that ‘test1.1’ model is the most likely. All the deviances
 are almost at the same level and the differences in AIC values are due for
 the new parameters added.

 Given the data was simulated to generate differences between models and
 model ‘test4.1’ is supposed to be the best one, I’m probably doing something
 wrong.
 Has anyone faced this kind of problem? Or has anyone any idea how to solve
 that?

 Thanks and Regards
 Gustavo Requena
 PhD student - Laboratory of Arthropod Behavior and Evolution
 Universidade de São Paulo
 http://ecologia.ib.usp.br/opilio/gustavo.html

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Logistic-regression-with-factorial-effect-tp3049208p3049208.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Bert Gunter
Genentech Nonclinical Biostatistics

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-22 Thread Kay Cichini

dear all,
thank you everyone for the profound answers and the needful references!

achim, thank you for the very kind offer!! sorrily i'm not around vienna in
the near feature, otherwise i'd be glad to coming back to your invitation.

yours,
kay

-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2334106.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-22 Thread Peter Dalgaard
On 08/22/2010 01:51 PM, Kay Cichini wrote:

 achim, thank you for the very kind offer!! sorrily i'm not around vienna in
 the near feature, otherwise i'd be glad to coming back to your invitation.

Not that it's any of my business, but I don't think you need to go THAT
far to visit Achim these days... grin

-pd

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Kay Cichini

hello gavin  achim,

thanks for responding.

by logistic regression tree i meant a regression tree for a binary response
variable.
but as you say i could also use a classification tree - in my case with only
two outcomes.

i'm not aware if there are substantial differences to expect for the two
approaches (logistic regression tree vs. classification tree with two
outcomes).

as i'm new to trees / boosting / etc. i also might be advised to use the
more comprehensible method / a function which argumentation is understood
without having to climb a steep learning ledder, respectively. at the moment
i don't know which this would be.

regarding the meaning of absences at stands: as these species are frequent
in the area and hence there is no limitation by propagules i guess absence
is really due to unfavourable conditions. 

thanks a lot,
kay

 

-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Achim Zeileis

On Fri, 20 Aug 2010, Kay Cichini wrote:



hello gavin  achim,

thanks for responding.

by logistic regression tree i meant a regression tree for a binary 
response variable. but as you say i could also use a classification tree 
- in my case with only two outcomes.


i'm not aware if there are substantial differences to expect for the two 
approaches (logistic regression tree vs. classification tree with two 
outcomes).


I don't think that there is a universally accepted terminology for this. 
Classification tree typically pertains to categorical responses 
(independet of the number of categories, i.e., also for binary responses).


Logistic regression tree is (to the best of my knowledge) not typically 
used as a term for binary classification trees.


Technical excursion:
However, logistic regression tree may mean a specific algorithm (LOTUS - 
LOgistic regression Tree with Unbiased Splits) developed by Kin-Yee Chan 
and Wei-Yin Loh. This algorithms shares various ideas with the LMT 
(Logistic Model Trees) algorithm developed by Niels Landwehr with 
co-authors (available in R through RWeka) and the MOB (MOdel-Based 
partitioning) algorithm when employed with binary GLMs (as available in 
the party package).


as i'm new to trees / boosting / etc. i also might be advised to use the 
more comprehensible method / a function which argumentation is 
understood without having to climb a steep learning ledder, 
respectively. at the moment i don't know which this would be.


Trees may be a good starting point. As I wrote to you off-list: Feel free 
to drop by my office if you want to chat about this.


Best,
Z

regarding the meaning of absences at stands: as these species are 
frequent in the area and hence there is no limitation by propagules i 
guess absence is really due to unfavourable conditions.


thanks a lot,
kay



-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Frank Harrell


It would be good to tell us of the frequency of observations in each 
category of Y, and the number of continuous X's.  Recursive 
partitioning will require perhaps 50,000 observations in the less 
frequent Y category for its structure and predicted values to 
validate, depending on X and the signal:noise ratio.  Hence the use of 
combinations of trees nowadays as opposed to single trees.  Or 
logistic regression.


Frank

Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

On Fri, 20 Aug 2010, Kay Cichini wrote:



hello gavin  achim,

thanks for responding.

by logistic regression tree i meant a regression tree for a binary response
variable.
but as you say i could also use a classification tree - in my case with only
two outcomes.

i'm not aware if there are substantial differences to expect for the two
approaches (logistic regression tree vs. classification tree with two
outcomes).

as i'm new to trees / boosting / etc. i also might be advised to use the
more comprehensible method / a function which argumentation is understood
without having to climb a steep learning ledder, respectively. at the moment
i don't know which this would be.

regarding the meaning of absences at stands: as these species are frequent
in the area and hence there is no limitation by propagules i guess absence
is really due to unfavourable conditions.

thanks a lot,
kay



-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2332447.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Kay Cichini

hello,

my data-collection is not yet finished, but i though have started
investigating possible analysis methods.

below i give a very close simulation of my future data-set, however there
might be more nominal explanatory variables - there will be no continous at
all  (maybe some ordered nominal..).  

i tried several packages today, but the one i fancied most was ctree of the
party package.
i can't see why the given no. of datapoints (n=100) might pose a problem
here - but please teach me better, as i might be naive..

i'd be very glad about comments on the use of ctree on suchalike dataset and
if i oversee possible pitfalls

thank you all,
kay

##
# an example with 3 nominal explanatory variables:
# Y is presence of a certain invasive plant species
# introduced effect for fac1 and fac3, fac2 without effect.
# presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr.
region) and  
# fac3 = a|b|c (say all richer substrates). 
# presence is not influenced by fac2, which might be vegetation type, i.e.
##
library(party)
dat-cbind(
expand.grid(fac1=c(I,II),
fac2=LETTERS[1:5],
fac3=letters[1:10]))

print(dat-dat[order(dat$fac1,dat$fac2,dat$fac3),])

dat$fac13-paste(dat$fac1,dat$fac3,sep=)
for(i in 1:nrow(dat)){
ifelse(dat$fac13[i]==Ia|dat$fac13[i]==Ib|dat$fac13[i]==Ic,
   dat$Y[i]-rbinom(1,1,0.75),
   dat$Y[i]-rbinom(1,1,0))
}
dat$Y-as.factor(dat$Y)

tr-ctree(Y~fac1+fac2+fac3,data=dat)
plot(tr)
##


-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


-- 
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2333073.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Gavin Simpson
On Fri, 2010-08-20 at 14:46 -0700, Kay Cichini wrote:
 hello,
 
 my data-collection is not yet finished, but i though have started
 investigating possible analysis methods.
 
 below i give a very close simulation of my future data-set, however there
 might be more nominal explanatory variables - there will be no continous at
 all  (maybe some ordered nominal..).  
 
 i tried several packages today, but the one i fancied most was ctree of the
 party package.
 i can't see why the given no. of datapoints (n=100) might pose a problem
 here - but please teach me better, as i might be naive..

I'm no expert, but single trees are unstable predictors; change your
data slightly and you might get a totally different model/tree. I hope
that worries you?

Frank's comment was that depending upon the signal-to-noise ratio in
your sample of data, you might need a very large data set indeed, much
larger than your 100 data points/samples, to have any confidence in the
single fitted tree.

For this reason, ensemble or committee methods have been developed that
combine the predictions from many trees fitted to perturbed versions of
the training data. Such methods include boosting and randomForests.

We are venturing into territory not suited to email list format;
statistical consultancy. As Achim is local to you and has kindly offered
to meet you, I would strongly suggest you take up his offer.

In the meantime, here are a couple of references to look at if you
aren't familiar with these statistical machine learning techniques.

Cutler et al (2007) Random forests for classification in ecology.
Ecology 88(11), 2783---2792.

Elith, J., Leathwick, J.R., and Hastie, T. (2008) A working guide to
boosted regression trees. Journal of Animal Ecology, 77, 802---813.

Also, don't dismiss the logistic regression model. Modern techniques
like the lasso and elastic net are available for GLMs such as this and
include model selection as part of their fitting. These are underused by
ecologists (IMHO) who seem to like (abuse?)the information theoretic
approaches and step-wise selection procedures... (apologies to
ecologists here [I am one too] for being general!) See:

Dahlgren J.p. (2010) Alternative regression methods are not considered
in Murtaugh (2009) or by ecologists in general. Ecology Letters 13(5)
E7-E9.

HTH

G

 i'd be very glad about comments on the use of ctree on suchalike dataset and
 if i oversee possible pitfalls
 
 thank you all,
 kay
 
 ##
 # an example with 3 nominal explanatory variables:
 # Y is presence of a certain invasive plant species
 # introduced effect for fac1 and fac3, fac2 without effect.
 # presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr.
 region) and  
 # fac3 = a|b|c (say all richer substrates). 
 # presence is not influenced by fac2, which might be vegetation type, i.e.
 ##
 library(party)
 dat-cbind(
 expand.grid(fac1=c(I,II),
 fac2=LETTERS[1:5],
 fac3=letters[1:10]))
 
 print(dat-dat[order(dat$fac1,dat$fac2,dat$fac3),])
 
 dat$fac13-paste(dat$fac1,dat$fac3,sep=)
 for(i in 1:nrow(dat)){
 ifelse(dat$fac13[i]==Ia|dat$fac13[i]==Ib|dat$fac13[i]==Ic,
dat$Y[i]-rbinom(1,1,0.75),
dat$Y[i]-rbinom(1,1,0))
 }
 dat$Y-as.factor(dat$Y)
 
 tr-ctree(Y~fac1+fac2+fac3,data=dat)
 plot(tr)
 ##
 
 
 -
 
 Kay Cichini
 Postgraduate student
 Institute of Botany
 Univ. of Innsbruck
 
 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-20 Thread Frank Harrell



On Fri, 20 Aug 2010, Kay Cichini wrote:



hello,

my data-collection is not yet finished, but i though have started
investigating possible analysis methods.

below i give a very close simulation of my future data-set, however there
might be more nominal explanatory variables - there will be no continous at
all  (maybe some ordered nominal..).

i tried several packages today, but the one i fancied most was ctree of the
party package.
i can't see why the given no. of datapoints (n=100) might pose a problem
here - but please teach me better, as i might be naive..


See

http://biostat.mc.vanderbilt.edu/wiki/Main/ComplexDataJournalClub#Sebastiani_et_al_Nature_Genetics

The recursive partitioning simulation there will give you an idea - 
you can modify the R code to simulate a situation more like yours. 
When you simulate the true patterns and see how far the tree is from 
discovering the true patterns, you'll be surprised.


Frank

 

i'd be very glad about comments on the use of ctree on suchalike dataset and
if i oversee possible pitfalls

thank you all,
kay

##
# an example with 3 nominal explanatory variables:
# Y is presence of a certain invasive plant species
# introduced effect for fac1 and fac3, fac2 without effect.
# presence with prob. 0.75 in factor combination fac1=I (say fac1 is geogr.
region) and
# fac3 = a|b|c (say all richer substrates).
# presence is not influenced by fac2, which might be vegetation type, i.e.
##
library(party)
dat-cbind(
expand.grid(fac1=c(I,II),
   fac2=LETTERS[1:5],
   fac3=letters[1:10]))

print(dat-dat[order(dat$fac1,dat$fac2,dat$fac3),])

dat$fac13-paste(dat$fac1,dat$fac3,sep=)
for(i in 1:nrow(dat)){
ifelse(dat$fac13[i]==Ia|dat$fac13[i]==Ib|dat$fac13[i]==Ic,
  dat$Y[i]-rbinom(1,1,0.75),
  dat$Y[i]-rbinom(1,1,0))
}
dat$Y-as.factor(dat$Y)

tr-ctree(Y~fac1+fac2+fac3,data=dat)
plot(tr)
##


-

Kay Cichini
Postgraduate student
Institute of Botany
Univ. of Innsbruck


--
View this message in context: 
http://r.789695.n4.nabble.com/logistic-regression-tree-tp2331847p2333073.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] logistic regression tree

2010-08-19 Thread Gavin Simpson
On Thu, 2010-08-19 at 13:42 -0700, Kay Cichini wrote:
 hello everyone,
 
 i sampled 100 stands at 20 restoration sites and presence of 3 different
 invasive plant species. 
 i came across logistic regression trees and wonder if this is suited for my
 purpose - predicting presence of these problematic invasive plant species
 (one by one) by a set of recorded ecological / geographical parameters.
 i'd be glad if someone would comment on applying this mehtod to such data -
 maybe someone could point me useful references.
 also, i was not able to find out if there is a package implementing logistic
 regression?

Not sure what a logistic regression tree is, but a classification tree
would be useful here: Treat each species as present (== 1) or absent (==
0) and try to fit a tree consisting of a set of splits in X covariates
that minimise a suitable deviance criterion.

If you want to fit all three species at once, try multivariate trees,
but IIRC, they (in package mvpart at least) expect a count-based data
set, i.e. the deviance criterion they used (sum of squares) is probably
not suited to binary type data.

The one problem I foresee is that you only have 100 data points and even
that number is pseudo replicated as you have multiple samples from just
20 sites. Trees are unstable at the best of times and work best when
given a lot of data. Boosting, bagging and randomForests can help but
they again work best/well with large data sets. I suppose large will be
relative to the signal to noise ratio in your data.

Ecologically, one needs to consider what a 0 value means (an absence):
was the invasive not present due to the environment being bad or just
because it hasn't got there yet despite environment being good? How you
deal with that is anybody's guess.

Try the R-SIG-Ecology list for further help.

G

 
 thanks in advance,
 kay
 
 -
 
 Kay Cichini
 Postgraduate student
 Institute of Botany
 Univ. of Innsbruck
 
 

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London  [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   >