or i guess GLM implies one can feed a constant as one of x_0 or something.
People seem to describe the same stuff from so many angles and so many notations.. Bishop doesn't even mention 'generalized linear model' term but just talks about 'common practices' in MLE... sorry for stupid questions. On Sat, Sep 4, 2010 at 1:28 PM, Dmitriy Lyubimov <[email protected]> wrote: > Thank you, Ted, this is very instructive. > > There's something i don't understand about your derivation . > > > > I think Bishop generally suggests that in linear regression y=beta_0 + > <beta, x> (so there's an intercept) > and i think he uses similar approach with fitting to logistic function > where i think he suggests to use P( [mu + <beta,x>]/s ) > which of course can be thought of again as P(beta_0+<beta,x>) > > but if there's no intercept beta_0, then y(x=(0,...0)^T | beta) is always > 0. Which is not true of course in most situations. Does your method imply > that having trivial input (all 0s ) would produce 0 estimation? > > Second question, are the betas allowed to go negative? > > Thank you, sir. > > -Dmitriy > > On Sat, Jul 10, 2010 at 10:36 AM, Ted Dunning <[email protected]>wrote: > >> On Sat, Jul 10, 2010 at 5:26 AM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >> > But what if the variable being regressed has signficant variance? >> > >> >> for generalized regression in general, the regressor converges to the >> expectation, pretty much as you suggest: >> >> E[y] = g^{-1}(\beta x) >> >> The link function g determines whether the regression is logistic >> regression >> or least squares or Poisson. >> >> >> > >> > Say in such a heartbreaking example where people are coming into store, >> > some >> > of them end up buying something for some $$ but most don't buy anything >> > (sale $=0). Suppose i use SGD regression to regress the sale $ using >> bunch >> > of individual sale regressors (such as person's profile, store >> theme/focus >> > etc.) >> > >> >> This kind of mixed discrete/continuous problem is often best attacked by >> factoring it. First model p($>0) using logistic regression (or whatever >> binary regression technique is fashionable/effective). Then model the >> (nearly) continuous distribution p($ | $ > 0). >> >> The rationale here is that you often get a better result from this >> composite >> model than with the model that models both steps at once. For instance, >> in >> one case I have seen, p($ | $ > 0) was essentially trivial because there >> was >> very good knowledge about what the person was likely to be based on which >> ad >> they clicked. Combining the models, however, increased the dimensionality >> enough to make the p($) model significantly harder to learn than p($ | >> $>0) >> or p($>0). >> >> >> >> > >> > Obviously this regressand has a very high variance... But... If i can >> hope >> > to converge on the math expectancy of the sale, then i would be able to >> > predict say daily sales for individual stores based on amount of people >> > visited per day --or for that matter whatever interval as long as we >> know >> > how many people were there (which basically makes my manager happy for >> the >> > moment). Another thing is that i want to try to come up with E(sale) for >> > every new person coming into a store before he or she makes any deals >> based >> > on various regressors such as person profile, store focus etc. >> > >> >> This sounds like you are 90% there. >> >> >> > >> > >> > So intuitively i feel that that SGD must converge on the E(regressand) >> in >> > cases where variance(regressand) is quite high as SGD basically >> minimizes >> > RMSE (which is essentially same as the variance). Is that correct? But i >> am >> > not quite sure if that is backed by the math of stochastic gradient >> > descent. >> > >> >> Yes. For convex loss functions, SGD converges toward the MLE estimate. >> >> This isn't quite the same as minimum squared error, but your intuitions >> are >> going the right direction. >> >> Another question is would there be difference between the cases of SGD+MLE >> > vs. SGD+least squares methods for high variance regressands? >> > >> >> Yes. There is a difference, but in practice it isn't that big a deal. >> Vowpal wabbit uses RMSE as a loss function by default and simply limits >> the >> output value to the 0,1 range. This works quite well. Mahouts SGD uses >> the MLE of logistic regression. That also works well. I will be posting >> an >> updated patch today that does confidence weighted learning which >> considerably improves convergence time. >> > >
