Re: Effects problems in logistic regression
Sounds great. Sincerely, DB Tsai --- Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Dec 22, 2014 at 5:27 AM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me! > > > > *De:* Franco Barrientos [mailto:franco.barrien...@exalitica.com] > *Enviado el:* jueves, 18 de diciembre de 2014 16:42 > *Para:* 'DB Tsai' > *CC:* 'Sean Owen'; user@spark.apache.org > *Asunto:* RE: Effects problems in logistic regression > > > > Thanks I will try. > > > > *De:* DB Tsai [mailto:dbt...@dbtsai.com ] > *Enviado el:* jueves, 18 de diciembre de 2014 16:24 > *Para:* Franco Barrientos > *CC:* Sean Owen; user@spark.apache.org > *Asunto:* Re: Effects problems in logistic regression > > > > Can you try LogisticRegressionWithLBFGS? I verified that this will be > converged to the same result trained by R's glmnet package without > regularization. The problem of LogisticRegressionWithSGD is it's very > slow in term of converging, and lots of time, it's very sensitive to > stepsize which can lead to wrong answer. > > > > The regularization logic in MLLib is not entirely correct, and it will > penalize the intercept. In general, with really high regularization, all > the coefficients will be zeros except the intercept. In logistic > regression, the non-zero intercept can be understood as the > prior-probability of each class, and in linear regression, this will be the > mean of response. I'll have a PR to fix this issue. > > > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos < > franco.barrien...@exalitica.com> wrote: > > Yes, without the “amounts” variables the results are similiar. When I put > other variables its fine. > > > > *De:* Sean Owen [mailto:so...@cloudera.com] > *Enviado el:* jueves, 18 de diciembre de 2014 14:22 > *Para:* Franco Barrientos > *CC:* user@spark.apache.org > *Asunto:* Re: Effects problems in logistic regression > > > > Are you sure this is an apples-to-apples comparison? for example does your > SAS process normalize or otherwise transform the data first? > > > > Is the optimization configured similarly in both cases -- same > regularization, etc.? > > > > Are you sure you are pulling out the intercept correctly? It is a separate > value from the logistic regression model in Spark. > > > > On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos < > franco.barrien...@exalitica.com> wrote: > > Hi all!, > > > > I have a problem with LogisticRegressionWithSGD, when I train a data set > with one variable (wich is a amount of an item) and intercept, I get > weights of > > (-0.4021,-207.1749) for both features, respectively. This don´t make sense > to me because I run a logistic regression for the same data in SAS and I > get these weights (-2.6604,0.000245). > > > > The rank of this variable is from 0 to 59102 with a mean of 1158. > > > > The problem is when I want to calculate the probabilities for each user > from data set, this probability is near to zero or zero in much cases, > because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is > a big number, in fact infinity for spark. > > > > How can I treat this variable? or why this happened? > > > > Thanks , > > > > *Franco Barrientos* > Data Scientist > > Málaga #115, Of. 1003, Las Condes. > Santiago, Chile. > (+562)-29699649 > (+569)-76347893 > > franco.barrien...@exalitica.com > > www.exalitica.com > > [image: http://exalitica.com/web/img/frim.png] > > > >
RE: Effects problems in logistic regression
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me! De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] Enviado el: jueves, 18 de diciembre de 2014 16:42 Para: 'DB Tsai' CC: 'Sean Owen'; user@spark.apache.org Asunto: RE: Effects problems in logistic regression Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org <mailto:user@spark.apache.org> Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com <mailto:so...@cloudera.com> ] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org <mailto:user@spark.apache.org> Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> <http://exalitica.com/web/img/frim.png>
RE: Effects problems in logistic regression
Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com <mailto:so...@cloudera.com> ] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org <mailto:user@spark.apache.org> Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> <http://exalitica.com/web/img/frim.png>
Re: Effects problems in logistic regression
Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > > Yes, without the “amounts” variables the results are similiar. When I put > other variables its fine. > > > > *De:* Sean Owen [mailto:so...@cloudera.com] > *Enviado el:* jueves, 18 de diciembre de 2014 14:22 > *Para:* Franco Barrientos > *CC:* user@spark.apache.org > *Asunto:* Re: Effects problems in logistic regression > > > > Are you sure this is an apples-to-apples comparison? for example does your > SAS process normalize or otherwise transform the data first? > > > > Is the optimization configured similarly in both cases -- same > regularization, etc.? > > > > Are you sure you are pulling out the intercept correctly? It is a separate > value from the logistic regression model in Spark. > > > > On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos < > franco.barrien...@exalitica.com> wrote: > > Hi all!, > > > > I have a problem with LogisticRegressionWithSGD, when I train a data set > with one variable (wich is a amount of an item) and intercept, I get > weights of > > (-0.4021,-207.1749) for both features, respectively. This don´t make sense > to me because I run a logistic regression for the same data in SAS and I > get these weights (-2.6604,0.000245). > > > > The rank of this variable is from 0 to 59102 with a mean of 1158. > > > > The problem is when I want to calculate the probabilities for each user > from data set, this probability is near to zero or zero in much cases, > because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is > a big number, in fact infinity for spark. > > > > How can I treat this variable? or why this happened? > > > > Thanks , > > > > *Franco Barrientos* > Data Scientist > > Málaga #115, Of. 1003, Las Condes. > Santiago, Chile. > (+562)-29699649 > (+569)-76347893 > > franco.barrien...@exalitica.com > > www.exalitica.com > > [image: http://exalitica.com/web/img/frim.png] > > > >
RE: Effects problems in logistic regression
Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> <http://exalitica.com/web/img/frim.png>
Re: Effects problems in logistic regression
Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > > Hi all!, > > > > I have a problem with LogisticRegressionWithSGD, when I train a data set > with one variable (wich is a amount of an item) and intercept, I get > weights of > > (-0.4021,-207.1749) for both features, respectively. This don´t make sense > to me because I run a logistic regression for the same data in SAS and I > get these weights (-2.6604,0.000245). > > > > The rank of this variable is from 0 to 59102 with a mean of 1158. > > > > The problem is when I want to calculate the probabilities for each user > from data set, this probability is near to zero or zero in much cases, > because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is > a big number, in fact infinity for spark. > > > > How can I treat this variable? or why this happened? > > > > Thanks , > > > > *Franco Barrientos* > Data Scientist > > Málaga #115, Of. 1003, Las Condes. > Santiago, Chile. > (+562)-29699649 > (+569)-76347893 > > franco.barrien...@exalitica.com > > www.exalitica.com > > [image: http://exalitica.com/web/img/frim.png] > > >