Re: [scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Michael Eickenberg
Here is a possibly useful comment of larsmans on stackoverflow about
exactly this procedure

http://stackoverflow.com/questions/26604175/how-to-predict-a-continuous-dependent-variable-that-expresses-target-class-proba/26614131#comment41846816_26614131


On Mon, Oct 10, 2016 at 4:04 PM, Sean Violante 
wrote:

> sorry yes there was a misunderstanding:
>
> I meant for each feature configuration you should pass in two rows (one
> for the positive cases and one for the negative)
> and the sample weight being the corresponding count for that configuration
> and class
>
> and I am saying that the total  count is important because you could have
> a situation where
> one feature combination occurs 10 times and another feature combination
> 1000 times
>
>
>
>
>
> On Mon, Oct 10, 2016 at 3:48 PM, Raphael C  wrote:
>
>> On 10 October 2016 at 12:22, Sean Violante 
>> wrote:
>> > no ( but please check !)
>> >
>> > sample weights should be the counts for the respective label (0/1)
>> >
>> > [ I am actually puzzled about the glm help file - proportions loses how
>> > often an input data 'row' was present relative to the other - though you
>> > could do this by repeating the row 'n' times]
>>
>> I think we might be talking at cross purposes.
>>
>> I have a matrix X where each row is a feature vector. I also have an
>> array y where y[i] is a real number between 0 and 1. I would like to
>> build a regression model that predicts the y values given the X rows.
>>
>> Now each y[i] value in fact comes from simply counting the number of
>> positive labelled elements in a particular set (set i) and dividing by
>> the number of elements in that set.  So I can easily fit this into the
>> model given by the R package glm by replacing each y[i] value by a
>> pair of "Number of positives" and "Number of negatives" (this is case
>> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus
>> the total number of elements in set i.
>>
>> I don't see how a single integer for sample_weight[i] would cover this
>> information but I am sure I must have misunderstood.  At best it seems
>> it could cover the number of positive values but this is missing half
>> the information.
>>
>> Raphael
>>
>> >
>> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C  wrote:
>> >>
>> >> How do I use sample_weight for my use case?
>> >>
>> >> In my case is "y" an array of 0s and 1s and sample_weight then an
>> >> array real numbers between 0 and 1 where I should make sure to set
>> >> sample_weight[i]= 0 when y[i] = 0?
>> >>
>> >> Raphael
>> >>
>> >> On 10 October 2016 at 12:08, Sean Violante 
>> >> wrote:
>> >> > should be the sample weight function in fit
>> >> >
>> >> >
>> >> > http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.LogisticRegression.html
>> >> >
>> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C  wrote:
>> >> >>
>> >> >> I just noticed this about the glm package in R.
>> >> >> http://stats.stackexchange.com/a/26779/53128
>> >> >>
>> >> >> "
>> >> >> The glm function in R allows 3 ways to specify the formula for a
>> >> >> logistic regression model.
>> >> >>
>> >> >> The most common is that each row of the data frame represents a
>> single
>> >> >> observation and the response variable is either 0 or 1 (or a factor
>> >> >> with 2 levels, or other varibale with only 2 unique values).
>> >> >>
>> >> >> Another option is to use a 2 column matrix as the response variable
>> >> >> with the first column being the counts of 'successes' and the second
>> >> >> column being the counts of 'failures'.
>> >> >>
>> >> >> You can also specify the response as a proportion between 0 and 1,
>> >> >> then specify another column as the 'weight' that gives the total
>> >> >> number that the proportion is from (so a response of 0.3 and a
>> weight
>> >> >> of 10 is the same as 3 'successes' and 7 'failures')."
>> >> >>
>> >> >> Either of the last two options would do for me.  Does scikit-learn
>> >> >> support either of these last two options?
>> >> >>
>> >> >> Raphael
>> >> >>
>> >> >> On 10 October 2016 at 11:55, Raphael C  wrote:
>> >> >> > I am trying to perform regression where my dependent variable is
>> >> >> > constrained to be between 0 and 1. This constraint comes from the
>> >> >> > fact
>> >> >> > that it represents a count proportion. That is counts in some
>> >> >> > category
>> >> >> > divided by a total count.
>> >> >> >
>> >> >> > In the literature it seems that one common way to tackle this is
>> to
>> >> >> > use logistic regression. However, it appears that in scikit learn
>> >> >> > logistic regression is only available as a classifier
>> >> >> >
>> >> >> >
>> >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.LogisticRegression.html
>> >> >> > ) . Is that right?
>> >> >> >
>> >> >> > Is there another way to perform regression using 

Re: [scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Sean Violante
sorry yes there was a misunderstanding:

I meant for each feature configuration you should pass in two rows (one for
the positive cases and one for the negative)
and the sample weight being the corresponding count for that configuration
and class

and I am saying that the total  count is important because you could have a
situation where
one feature combination occurs 10 times and another feature combination
1000 times





On Mon, Oct 10, 2016 at 3:48 PM, Raphael C  wrote:

> On 10 October 2016 at 12:22, Sean Violante 
> wrote:
> > no ( but please check !)
> >
> > sample weights should be the counts for the respective label (0/1)
> >
> > [ I am actually puzzled about the glm help file - proportions loses how
> > often an input data 'row' was present relative to the other - though you
> > could do this by repeating the row 'n' times]
>
> I think we might be talking at cross purposes.
>
> I have a matrix X where each row is a feature vector. I also have an
> array y where y[i] is a real number between 0 and 1. I would like to
> build a regression model that predicts the y values given the X rows.
>
> Now each y[i] value in fact comes from simply counting the number of
> positive labelled elements in a particular set (set i) and dividing by
> the number of elements in that set.  So I can easily fit this into the
> model given by the R package glm by replacing each y[i] value by a
> pair of "Number of positives" and "Number of negatives" (this is case
> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus
> the total number of elements in set i.
>
> I don't see how a single integer for sample_weight[i] would cover this
> information but I am sure I must have misunderstood.  At best it seems
> it could cover the number of positive values but this is missing half
> the information.
>
> Raphael
>
> >
> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C  wrote:
> >>
> >> How do I use sample_weight for my use case?
> >>
> >> In my case is "y" an array of 0s and 1s and sample_weight then an
> >> array real numbers between 0 and 1 where I should make sure to set
> >> sample_weight[i]= 0 when y[i] = 0?
> >>
> >> Raphael
> >>
> >> On 10 October 2016 at 12:08, Sean Violante 
> >> wrote:
> >> > should be the sample weight function in fit
> >> >
> >> >
> >> > http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> >
> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C  wrote:
> >> >>
> >> >> I just noticed this about the glm package in R.
> >> >> http://stats.stackexchange.com/a/26779/53128
> >> >>
> >> >> "
> >> >> The glm function in R allows 3 ways to specify the formula for a
> >> >> logistic regression model.
> >> >>
> >> >> The most common is that each row of the data frame represents a
> single
> >> >> observation and the response variable is either 0 or 1 (or a factor
> >> >> with 2 levels, or other varibale with only 2 unique values).
> >> >>
> >> >> Another option is to use a 2 column matrix as the response variable
> >> >> with the first column being the counts of 'successes' and the second
> >> >> column being the counts of 'failures'.
> >> >>
> >> >> You can also specify the response as a proportion between 0 and 1,
> >> >> then specify another column as the 'weight' that gives the total
> >> >> number that the proportion is from (so a response of 0.3 and a weight
> >> >> of 10 is the same as 3 'successes' and 7 'failures')."
> >> >>
> >> >> Either of the last two options would do for me.  Does scikit-learn
> >> >> support either of these last two options?
> >> >>
> >> >> Raphael
> >> >>
> >> >> On 10 October 2016 at 11:55, Raphael C  wrote:
> >> >> > I am trying to perform regression where my dependent variable is
> >> >> > constrained to be between 0 and 1. This constraint comes from the
> >> >> > fact
> >> >> > that it represents a count proportion. That is counts in some
> >> >> > category
> >> >> > divided by a total count.
> >> >> >
> >> >> > In the literature it seems that one common way to tackle this is to
> >> >> > use logistic regression. However, it appears that in scikit learn
> >> >> > logistic regression is only available as a classifier
> >> >> >
> >> >> >
> >> >> > (http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> >> > ) . Is that right?
> >> >> >
> >> >> > Is there another way to perform regression using scikit learn where
> >> >> > the dependent variable is a count proportion?
> >> >> >
> >> >> > Thanks for any help.
> >> >> >
> >> >> > Raphael
> >> >> ___
> >> >> scikit-learn mailing list
> >> >> scikit-learn@python.org
> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> >
> >> >
> >> > ___
> >> > scikit-learn mailing list
> >> > scikit-learn@python.org
> 

Re: [scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Raphael C
On 10 October 2016 at 12:22, Sean Violante  wrote:
> no ( but please check !)
>
> sample weights should be the counts for the respective label (0/1)
>
> [ I am actually puzzled about the glm help file - proportions loses how
> often an input data 'row' was present relative to the other - though you
> could do this by repeating the row 'n' times]

I think we might be talking at cross purposes.

I have a matrix X where each row is a feature vector. I also have an
array y where y[i] is a real number between 0 and 1. I would like to
build a regression model that predicts the y values given the X rows.

Now each y[i] value in fact comes from simply counting the number of
positive labelled elements in a particular set (set i) and dividing by
the number of elements in that set.  So I can easily fit this into the
model given by the R package glm by replacing each y[i] value by a
pair of "Number of positives" and "Number of negatives" (this is case
2 in the docs I quoted) or using case 3 which asks for the y[i] plus
the total number of elements in set i.

I don't see how a single integer for sample_weight[i] would cover this
information but I am sure I must have misunderstood.  At best it seems
it could cover the number of positive values but this is missing half
the information.

Raphael

>
> On Mon, Oct 10, 2016 at 1:15 PM, Raphael C  wrote:
>>
>> How do I use sample_weight for my use case?
>>
>> In my case is "y" an array of 0s and 1s and sample_weight then an
>> array real numbers between 0 and 1 where I should make sure to set
>> sample_weight[i]= 0 when y[i] = 0?
>>
>> Raphael
>>
>> On 10 October 2016 at 12:08, Sean Violante 
>> wrote:
>> > should be the sample weight function in fit
>> >
>> >
>> > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> >
>> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C  wrote:
>> >>
>> >> I just noticed this about the glm package in R.
>> >> http://stats.stackexchange.com/a/26779/53128
>> >>
>> >> "
>> >> The glm function in R allows 3 ways to specify the formula for a
>> >> logistic regression model.
>> >>
>> >> The most common is that each row of the data frame represents a single
>> >> observation and the response variable is either 0 or 1 (or a factor
>> >> with 2 levels, or other varibale with only 2 unique values).
>> >>
>> >> Another option is to use a 2 column matrix as the response variable
>> >> with the first column being the counts of 'successes' and the second
>> >> column being the counts of 'failures'.
>> >>
>> >> You can also specify the response as a proportion between 0 and 1,
>> >> then specify another column as the 'weight' that gives the total
>> >> number that the proportion is from (so a response of 0.3 and a weight
>> >> of 10 is the same as 3 'successes' and 7 'failures')."
>> >>
>> >> Either of the last two options would do for me.  Does scikit-learn
>> >> support either of these last two options?
>> >>
>> >> Raphael
>> >>
>> >> On 10 October 2016 at 11:55, Raphael C  wrote:
>> >> > I am trying to perform regression where my dependent variable is
>> >> > constrained to be between 0 and 1. This constraint comes from the
>> >> > fact
>> >> > that it represents a count proportion. That is counts in some
>> >> > category
>> >> > divided by a total count.
>> >> >
>> >> > In the literature it seems that one common way to tackle this is to
>> >> > use logistic regression. However, it appears that in scikit learn
>> >> > logistic regression is only available as a classifier
>> >> >
>> >> >
>> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> >> > ) . Is that right?
>> >> >
>> >> > Is there another way to perform regression using scikit learn where
>> >> > the dependent variable is a count proportion?
>> >> >
>> >> > Thanks for any help.
>> >> >
>> >> > Raphael
>> >> ___
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Raphael C
How do I use sample_weight for my use case?

In my case is "y" an array of 0s and 1s and sample_weight then an
array real numbers between 0 and 1 where I should make sure to set
sample_weight[i]= 0 when y[i] = 0?

Raphael

On 10 October 2016 at 12:08, Sean Violante  wrote:
> should be the sample weight function in fit
>
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>
> On Mon, Oct 10, 2016 at 1:03 PM, Raphael C  wrote:
>>
>> I just noticed this about the glm package in R.
>> http://stats.stackexchange.com/a/26779/53128
>>
>> "
>> The glm function in R allows 3 ways to specify the formula for a
>> logistic regression model.
>>
>> The most common is that each row of the data frame represents a single
>> observation and the response variable is either 0 or 1 (or a factor
>> with 2 levels, or other varibale with only 2 unique values).
>>
>> Another option is to use a 2 column matrix as the response variable
>> with the first column being the counts of 'successes' and the second
>> column being the counts of 'failures'.
>>
>> You can also specify the response as a proportion between 0 and 1,
>> then specify another column as the 'weight' that gives the total
>> number that the proportion is from (so a response of 0.3 and a weight
>> of 10 is the same as 3 'successes' and 7 'failures')."
>>
>> Either of the last two options would do for me.  Does scikit-learn
>> support either of these last two options?
>>
>> Raphael
>>
>> On 10 October 2016 at 11:55, Raphael C  wrote:
>> > I am trying to perform regression where my dependent variable is
>> > constrained to be between 0 and 1. This constraint comes from the fact
>> > that it represents a count proportion. That is counts in some category
>> > divided by a total count.
>> >
>> > In the literature it seems that one common way to tackle this is to
>> > use logistic regression. However, it appears that in scikit learn
>> > logistic regression is only available as a classifier
>> >
>> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> > ) . Is that right?
>> >
>> > Is there another way to perform regression using scikit learn where
>> > the dependent variable is a count proportion?
>> >
>> > Thanks for any help.
>> >
>> > Raphael
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn