On Sat, Jul 10, 2010 at 5:26 AM, Dmitriy Lyubimov <[email protected]> wrote:
> But what if the variable being regressed has signficant variance?
>
for generalized regression in general, the regressor converges to the
expectation, pretty much as you suggest:
E[y] = g^{-1}(\beta x)
The link function g determines whether the regression is logistic regression
or least squares or Poisson.
>
> Say in such a heartbreaking example where people are coming into store,
> some
> of them end up buying something for some $$ but most don't buy anything
> (sale $=0). Suppose i use SGD regression to regress the sale $ using bunch
> of individual sale regressors (such as person's profile, store theme/focus
> etc.)
>
This kind of mixed discrete/continuous problem is often best attacked by
factoring it. First model p($>0) using logistic regression (or whatever
binary regression technique is fashionable/effective). Then model the
(nearly) continuous distribution p($ | $ > 0).
The rationale here is that you often get a better result from this composite
model than with the model that models both steps at once. For instance, in
one case I have seen, p($ | $ > 0) was essentially trivial because there was
very good knowledge about what the person was likely to be based on which ad
they clicked. Combining the models, however, increased the dimensionality
enough to make the p($) model significantly harder to learn than p($ | $>0)
or p($>0).
>
> Obviously this regressand has a very high variance... But... If i can hope
> to converge on the math expectancy of the sale, then i would be able to
> predict say daily sales for individual stores based on amount of people
> visited per day --or for that matter whatever interval as long as we know
> how many people were there (which basically makes my manager happy for the
> moment). Another thing is that i want to try to come up with E(sale) for
> every new person coming into a store before he or she makes any deals based
> on various regressors such as person profile, store focus etc.
>
This sounds like you are 90% there.
>
>
> So intuitively i feel that that SGD must converge on the E(regressand) in
> cases where variance(regressand) is quite high as SGD basically minimizes
> RMSE (which is essentially same as the variance). Is that correct? But i am
> not quite sure if that is backed by the math of stochastic gradient
> descent.
>
Yes. For convex loss functions, SGD converges toward the MLE estimate.
This isn't quite the same as minimum squared error, but your intuitions are
going the right direction.
Another question is would there be difference between the cases of SGD+MLE
> vs. SGD+least squares methods for high variance regressands?
>
Yes. There is a difference, but in practice it isn't that big a deal.
Vowpal wabbit uses RMSE as a loss function by default and simply limits the
output value to the 0,1 range. This works quite well. Mahouts SGD uses
the MLE of logistic regression. That also works well. I will be posting an
updated patch today that does confidence weighted learning which
considerably improves convergence time.