On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer <b...@gmx.net> wrote: > Hi, > > Thanks for your comments. > > I modified the examples from the mahout in action book, therefore I used > the hashed approach and that's why i used 100 features. I'll adjust the > number. >
Makes sense. But the book was doing sparse features. > You say that I'm using the same CVE for all features, so you mean i > should create 12 separate CVE for adding features to the vector like this? > Yes. Otherwise you don't get different hashes. With a CVE, the hashing pattern is generated from the name of the variable. For a work encoder, the hashing pattern is generated by the name of the variable (specified at construction of the encoder) and the word itself (specified at encode time). Text is just repeated words except that the weights aren't necessarily linear in the number of times a word appears. In your case, you could have used a goofy trick with a word encoder where the "word" is the variable name and the value of the variable is passed as the weight of the word. But all of this hashing is really just extra work for you. Easier to just pack your data into a dense vector. > Finally, I thought online logistic regression meant that it is an online > algorithm so it's fine to train only once. Does it mean, should i invoke > the train method over and over again with the same training sample until > the next one arrives or how should i make the model converge (or at least > try to with the few samples) ? > What online really implies is that training data is measured in terms of number of input records instead of in terms of passes through the data. To converge, you have to see enough data. If that means you need to pass through the data several times to fool the learner ... well, it means you have to pass through the data several times. Some online learners are exact in that they always have the exact result at hand for all the data they have seen. Welford's algorithm for computing sample mean and variance is like that. Others approximate an answer. Most systems which are estimating some property of a distribution are necessarily approximate. In fact, even Welford's method for means is really only approximating the mean of the distribution based on what it has seen so far. It happens that it gives you the best possible estimate so far, but that is just because computing a mean is simple enough. With regularized logistic regression, the estimation is trickier and you can only say that the algorithm will converge to the correct result eventually rather than say that the answer is always as good as it can be. Another way to say it is that the key property of on-line learning is that the learning takes a fixed amount of time and no additional memory for each input example. > What would you suggest to use for incremental training instead of OLR? Is > mahout perhaps the wrong library? > Well, for thousands of examples, anything at all will work quite well, even R. Just keep all the data around and fit the data whenever requested. Take a look at glmnet for a very nicely done in-memory L1/L2 regularized learner. A quick experiment indicates that it will handle 200K samples of the sort you are looking in about a second with multiple levels of lambda thrown into the bargain. Versions available in R, Matlab and Fortran (at least). http://www-stat.stanford.edu/~tibs/glmnet-matlab/ This kind of in-memory, single machine problem is just not what Mahout is intended to solve.