On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <b...@gmx.net> wrote:

> Hi,
>
> Thanks for your comments.
>
> I modified the examples from the mahout in action book,  therefore I used
> the hashed approach and that's why i used 100 features. I'll adjust the
> number.
>

Makes sense.  But the book was doing sparse features.



> You say that I'm using the same CVE for all features,  so you mean i
> should create 12 separate CVE for adding features to the vector like this?
>

Yes.  Otherwise you don't get different hashes.  With a CVE, the hashing
pattern is generated from the name of the variable.  For a work encoder,
the hashing pattern is generated by the name of the variable (specified at
construction of the encoder) and the word itself (specified at encode
time).  Text is just repeated words except that the weights aren't
necessarily linear in the number of times a word appears.

In your case, you could have used a goofy trick with a word encoder where
the "word" is the variable name and the value of the variable is passed as
the weight of the word.

But all of this hashing is really just extra work for you.  Easier to just
pack your data into a dense vector.


> Finally, I thought online logistic regression meant that it is an online
> algorithm so it's fine to train only once. Does it mean, should i invoke
> the train method over and over again with the same training sample until
> the next one arrives or how should i make the model converge (or at least
> try to with the few samples) ?
>

What online really implies is that training data is measured in terms of
number of input records instead of in terms of passes through the data.  To
converge, you have to see enough data.  If that means you need to pass
through the data several times to fool the learner ... well, it means you
have to pass through the data several times.

Some online learners are exact in that they always have the exact result at
hand for all the data they have seen.  Welford's algorithm for computing
sample mean and variance is like that. Others approximate an answer.  Most
systems which are estimating some property of a distribution are
necessarily approximate.  In fact, even Welford's method for means is
really only approximating the mean of the distribution based on what it has
seen so far.  It happens that it gives you the best possible estimate so
far, but that is just because computing a mean is simple enough.  With
regularized logistic regression, the estimation is trickier and you can
only say that the algorithm will converge to the correct result eventually
rather than say that the answer is always as good as it can be.

Another way to say it is that the key property of on-line learning is that
the learning takes a fixed amount of time and no additional memory for each
input example.


> What would you suggest to use for incremental training instead of OLR?  Is
> mahout perhaps the wrong library?
>

Well, for thousands of examples, anything at all will work quite well, even
R.  Just keep all the data around and fit the data whenever requested.

Take a look at glmnet for a very nicely done in-memory L1/L2 regularized
learner.  A quick experiment indicates that it will handle 200K samples of
the sort you are looking in about a second with multiple levels of lambda
thrown into the bargain.  Versions available in R, Matlab and Fortran (at
least).

http://www-stat.stanford.edu/~tibs/glmnet-matlab/

This kind of in-memory, single machine problem is just not what Mahout is
intended to solve.

Reply via email to