Re: [scikit-learn] Using logistic regression with count proportions data
Here is a possibly useful comment of larsmans on stackoverflow about exactly this procedure http://stackoverflow.com/questions/26604175/how-to-predict-a-continuous-dependent-variable-that-expresses-target-class-proba/26614131#comment41846816_26614131 On Mon, Oct 10, 2016 at 4:04 PM, Sean Violantewrote: > sorry yes there was a misunderstanding: > > I meant for each feature configuration you should pass in two rows (one > for the positive cases and one for the negative) > and the sample weight being the corresponding count for that configuration > and class > > and I am saying that the total count is important because you could have > a situation where > one feature combination occurs 10 times and another feature combination > 1000 times > > > > > > On Mon, Oct 10, 2016 at 3:48 PM, Raphael C wrote: > >> On 10 October 2016 at 12:22, Sean Violante >> wrote: >> > no ( but please check !) >> > >> > sample weights should be the counts for the respective label (0/1) >> > >> > [ I am actually puzzled about the glm help file - proportions loses how >> > often an input data 'row' was present relative to the other - though you >> > could do this by repeating the row 'n' times] >> >> I think we might be talking at cross purposes. >> >> I have a matrix X where each row is a feature vector. I also have an >> array y where y[i] is a real number between 0 and 1. I would like to >> build a regression model that predicts the y values given the X rows. >> >> Now each y[i] value in fact comes from simply counting the number of >> positive labelled elements in a particular set (set i) and dividing by >> the number of elements in that set. So I can easily fit this into the >> model given by the R package glm by replacing each y[i] value by a >> pair of "Number of positives" and "Number of negatives" (this is case >> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus >> the total number of elements in set i. >> >> I don't see how a single integer for sample_weight[i] would cover this >> information but I am sure I must have misunderstood. At best it seems >> it could cover the number of positive values but this is missing half >> the information. >> >> Raphael >> >> > >> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: >> >> >> >> How do I use sample_weight for my use case? >> >> >> >> In my case is "y" an array of 0s and 1s and sample_weight then an >> >> array real numbers between 0 and 1 where I should make sure to set >> >> sample_weight[i]= 0 when y[i] = 0? >> >> >> >> Raphael >> >> >> >> On 10 October 2016 at 12:08, Sean Violante >> >> wrote: >> >> > should be the sample weight function in fit >> >> > >> >> > >> >> > http://scikit-learn.org/stable/modules/generated/sklearn. >> linear_model.LogisticRegression.html >> >> > >> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: >> >> >> >> >> >> I just noticed this about the glm package in R. >> >> >> http://stats.stackexchange.com/a/26779/53128 >> >> >> >> >> >> " >> >> >> The glm function in R allows 3 ways to specify the formula for a >> >> >> logistic regression model. >> >> >> >> >> >> The most common is that each row of the data frame represents a >> single >> >> >> observation and the response variable is either 0 or 1 (or a factor >> >> >> with 2 levels, or other varibale with only 2 unique values). >> >> >> >> >> >> Another option is to use a 2 column matrix as the response variable >> >> >> with the first column being the counts of 'successes' and the second >> >> >> column being the counts of 'failures'. >> >> >> >> >> >> You can also specify the response as a proportion between 0 and 1, >> >> >> then specify another column as the 'weight' that gives the total >> >> >> number that the proportion is from (so a response of 0.3 and a >> weight >> >> >> of 10 is the same as 3 'successes' and 7 'failures')." >> >> >> >> >> >> Either of the last two options would do for me. Does scikit-learn >> >> >> support either of these last two options? >> >> >> >> >> >> Raphael >> >> >> >> >> >> On 10 October 2016 at 11:55, Raphael C wrote: >> >> >> > I am trying to perform regression where my dependent variable is >> >> >> > constrained to be between 0 and 1. This constraint comes from the >> >> >> > fact >> >> >> > that it represents a count proportion. That is counts in some >> >> >> > category >> >> >> > divided by a total count. >> >> >> > >> >> >> > In the literature it seems that one common way to tackle this is >> to >> >> >> > use logistic regression. However, it appears that in scikit learn >> >> >> > logistic regression is only available as a classifier >> >> >> > >> >> >> > >> >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn. >> linear_model.LogisticRegression.html >> >> >> > ) . Is that right? >> >> >> > >> >> >> > Is there another way to perform regression using
Re: [scikit-learn] Using logistic regression with count proportions data
sorry yes there was a misunderstanding: I meant for each feature configuration you should pass in two rows (one for the positive cases and one for the negative) and the sample weight being the corresponding count for that configuration and class and I am saying that the total count is important because you could have a situation where one feature combination occurs 10 times and another feature combination 1000 times On Mon, Oct 10, 2016 at 3:48 PM, Raphael Cwrote: > On 10 October 2016 at 12:22, Sean Violante > wrote: > > no ( but please check !) > > > > sample weights should be the counts for the respective label (0/1) > > > > [ I am actually puzzled about the glm help file - proportions loses how > > often an input data 'row' was present relative to the other - though you > > could do this by repeating the row 'n' times] > > I think we might be talking at cross purposes. > > I have a matrix X where each row is a feature vector. I also have an > array y where y[i] is a real number between 0 and 1. I would like to > build a regression model that predicts the y values given the X rows. > > Now each y[i] value in fact comes from simply counting the number of > positive labelled elements in a particular set (set i) and dividing by > the number of elements in that set. So I can easily fit this into the > model given by the R package glm by replacing each y[i] value by a > pair of "Number of positives" and "Number of negatives" (this is case > 2 in the docs I quoted) or using case 3 which asks for the y[i] plus > the total number of elements in set i. > > I don't see how a single integer for sample_weight[i] would cover this > information but I am sure I must have misunderstood. At best it seems > it could cover the number of positive values but this is missing half > the information. > > Raphael > > > > > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: > >> > >> How do I use sample_weight for my use case? > >> > >> In my case is "y" an array of 0s and 1s and sample_weight then an > >> array real numbers between 0 and 1 where I should make sure to set > >> sample_weight[i]= 0 when y[i] = 0? > >> > >> Raphael > >> > >> On 10 October 2016 at 12:08, Sean Violante > >> wrote: > >> > should be the sample weight function in fit > >> > > >> > > >> > http://scikit-learn.org/stable/modules/generated/ > sklearn.linear_model.LogisticRegression.html > >> > > >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: > >> >> > >> >> I just noticed this about the glm package in R. > >> >> http://stats.stackexchange.com/a/26779/53128 > >> >> > >> >> " > >> >> The glm function in R allows 3 ways to specify the formula for a > >> >> logistic regression model. > >> >> > >> >> The most common is that each row of the data frame represents a > single > >> >> observation and the response variable is either 0 or 1 (or a factor > >> >> with 2 levels, or other varibale with only 2 unique values). > >> >> > >> >> Another option is to use a 2 column matrix as the response variable > >> >> with the first column being the counts of 'successes' and the second > >> >> column being the counts of 'failures'. > >> >> > >> >> You can also specify the response as a proportion between 0 and 1, > >> >> then specify another column as the 'weight' that gives the total > >> >> number that the proportion is from (so a response of 0.3 and a weight > >> >> of 10 is the same as 3 'successes' and 7 'failures')." > >> >> > >> >> Either of the last two options would do for me. Does scikit-learn > >> >> support either of these last two options? > >> >> > >> >> Raphael > >> >> > >> >> On 10 October 2016 at 11:55, Raphael C wrote: > >> >> > I am trying to perform regression where my dependent variable is > >> >> > constrained to be between 0 and 1. This constraint comes from the > >> >> > fact > >> >> > that it represents a count proportion. That is counts in some > >> >> > category > >> >> > divided by a total count. > >> >> > > >> >> > In the literature it seems that one common way to tackle this is to > >> >> > use logistic regression. However, it appears that in scikit learn > >> >> > logistic regression is only available as a classifier > >> >> > > >> >> > > >> >> > (http://scikit-learn.org/stable/modules/generated/ > sklearn.linear_model.LogisticRegression.html > >> >> > ) . Is that right? > >> >> > > >> >> > Is there another way to perform regression using scikit learn where > >> >> > the dependent variable is a count proportion? > >> >> > > >> >> > Thanks for any help. > >> >> > > >> >> > Raphael > >> >> ___ > >> >> scikit-learn mailing list > >> >> scikit-learn@python.org > >> >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > > >> > > >> > > >> > ___ > >> > scikit-learn mailing list > >> > scikit-learn@python.org >
Re: [scikit-learn] Using logistic regression with count proportions data
On 10 October 2016 at 12:22, Sean Violantewrote: > no ( but please check !) > > sample weights should be the counts for the respective label (0/1) > > [ I am actually puzzled about the glm help file - proportions loses how > often an input data 'row' was present relative to the other - though you > could do this by repeating the row 'n' times] I think we might be talking at cross purposes. I have a matrix X where each row is a feature vector. I also have an array y where y[i] is a real number between 0 and 1. I would like to build a regression model that predicts the y values given the X rows. Now each y[i] value in fact comes from simply counting the number of positive labelled elements in a particular set (set i) and dividing by the number of elements in that set. So I can easily fit this into the model given by the R package glm by replacing each y[i] value by a pair of "Number of positives" and "Number of negatives" (this is case 2 in the docs I quoted) or using case 3 which asks for the y[i] plus the total number of elements in set i. I don't see how a single integer for sample_weight[i] would cover this information but I am sure I must have misunderstood. At best it seems it could cover the number of positive values but this is missing half the information. Raphael > > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C wrote: >> >> How do I use sample_weight for my use case? >> >> In my case is "y" an array of 0s and 1s and sample_weight then an >> array real numbers between 0 and 1 where I should make sure to set >> sample_weight[i]= 0 when y[i] = 0? >> >> Raphael >> >> On 10 October 2016 at 12:08, Sean Violante >> wrote: >> > should be the sample weight function in fit >> > >> > >> > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >> > >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: >> >> >> >> I just noticed this about the glm package in R. >> >> http://stats.stackexchange.com/a/26779/53128 >> >> >> >> " >> >> The glm function in R allows 3 ways to specify the formula for a >> >> logistic regression model. >> >> >> >> The most common is that each row of the data frame represents a single >> >> observation and the response variable is either 0 or 1 (or a factor >> >> with 2 levels, or other varibale with only 2 unique values). >> >> >> >> Another option is to use a 2 column matrix as the response variable >> >> with the first column being the counts of 'successes' and the second >> >> column being the counts of 'failures'. >> >> >> >> You can also specify the response as a proportion between 0 and 1, >> >> then specify another column as the 'weight' that gives the total >> >> number that the proportion is from (so a response of 0.3 and a weight >> >> of 10 is the same as 3 'successes' and 7 'failures')." >> >> >> >> Either of the last two options would do for me. Does scikit-learn >> >> support either of these last two options? >> >> >> >> Raphael >> >> >> >> On 10 October 2016 at 11:55, Raphael C wrote: >> >> > I am trying to perform regression where my dependent variable is >> >> > constrained to be between 0 and 1. This constraint comes from the >> >> > fact >> >> > that it represents a count proportion. That is counts in some >> >> > category >> >> > divided by a total count. >> >> > >> >> > In the literature it seems that one common way to tackle this is to >> >> > use logistic regression. However, it appears that in scikit learn >> >> > logistic regression is only available as a classifier >> >> > >> >> > >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >> >> > ) . Is that right? >> >> > >> >> > Is there another way to perform regression using scikit learn where >> >> > the dependent variable is a count proportion? >> >> > >> >> > Thanks for any help. >> >> > >> >> > Raphael >> >> ___ >> >> scikit-learn mailing list >> >> scikit-learn@python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > >> > ___ >> > scikit-learn mailing list >> > scikit-learn@python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Using logistic regression with count proportions data
How do I use sample_weight for my use case? In my case is "y" an array of 0s and 1s and sample_weight then an array real numbers between 0 and 1 where I should make sure to set sample_weight[i]= 0 when y[i] = 0? Raphael On 10 October 2016 at 12:08, Sean Violantewrote: > should be the sample weight function in fit > > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html > > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C wrote: >> >> I just noticed this about the glm package in R. >> http://stats.stackexchange.com/a/26779/53128 >> >> " >> The glm function in R allows 3 ways to specify the formula for a >> logistic regression model. >> >> The most common is that each row of the data frame represents a single >> observation and the response variable is either 0 or 1 (or a factor >> with 2 levels, or other varibale with only 2 unique values). >> >> Another option is to use a 2 column matrix as the response variable >> with the first column being the counts of 'successes' and the second >> column being the counts of 'failures'. >> >> You can also specify the response as a proportion between 0 and 1, >> then specify another column as the 'weight' that gives the total >> number that the proportion is from (so a response of 0.3 and a weight >> of 10 is the same as 3 'successes' and 7 'failures')." >> >> Either of the last two options would do for me. Does scikit-learn >> support either of these last two options? >> >> Raphael >> >> On 10 October 2016 at 11:55, Raphael C wrote: >> > I am trying to perform regression where my dependent variable is >> > constrained to be between 0 and 1. This constraint comes from the fact >> > that it represents a count proportion. That is counts in some category >> > divided by a total count. >> > >> > In the literature it seems that one common way to tackle this is to >> > use logistic regression. However, it appears that in scikit learn >> > logistic regression is only available as a classifier >> > >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html >> > ) . Is that right? >> > >> > Is there another way to perform regression using scikit learn where >> > the dependent variable is a count proportion? >> > >> > Thanks for any help. >> > >> > Raphael >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] Using logistic regression with count proportions data
I am trying to perform regression where my dependent variable is constrained to be between 0 and 1. This constraint comes from the fact that it represents a count proportion. That is counts in some category divided by a total count. In the literature it seems that one common way to tackle this is to use logistic regression. However, it appears that in scikit learn logistic regression is only available as a classifier (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ) . Is that right? Is there another way to perform regression using scikit learn where the dependent variable is a count proportion? Thanks for any help. Raphael ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn