Yeah... some of his changes are actually tracking some of our stuff (not by looking at our stuff, but looking at the common upstream idea source). The active learning stuff is a different case and would be excellent to have. Essentially that involves sieving out training examples that "don't matter" which, in my experience, is typically 99+% of the data. In practice, down-sampling is the poor man's active learning, but real active learning would be much, much better.
On Thu, Dec 23, 2010 at 10:46 AM, Robin Anil <[email protected]> wrote: > Worth a look, maybe some new ideas for sgd package > > Robin > > http://hunch.net/?p=1594 > > Vowpal Wabbit, version 5.0, and the second heresy<http://hunch.net/?p=1594> > Tags: Announcements <http://hunch.net/?cat=4>, Code<http://hunch.net/?cat=42> > , Machine Learning <http://hunch.net/?cat=29> — jl <http://hunch.net/~jl>@ > 9:52 pm > > I’ve released version 5.0 <http://hunch.net/~vw/vowpal_wabbit_v5.0.tar.gz> of > the Vowpal Wabbit <http://hunch.net/~jl> online learning software. The > major number has changed since the last release > <http://hunch.net/?p=1068>because > I regard all earlier versions as obsolete—there are several new algorithms & > features including substantial changes and upgrades to the default learning > algorithm. > > The biggest changes are new algorithms: > > 1. Nikos <http://www.cs.cornell.edu/~nk/> and I improved the default > algorithm. The basic update rule still uses gradient descent, but the size > of the update is carefully controlled so that it’s impossible to overrun > the > label. In addition, the normalization has changed. Computationally, these > changes are virtually free and yield better results, sometimes much better. > Less careful updates can be reenabled with –loss_function classic, although > results are still not identical to previous due to normalization changes. > 2. Nikos also implemented the per-feature learning rates as per these > two <http://www.colt2010.org/papers/104mcmahan.pdf> > papers<http://www.colt2010.org/papers/023Duchi.pdf>. > Often, this works better than the default algorithm. It isn’t the default > because it isn’t (yet) as adaptable in terms of learning rate decay. This > is > enabled with –adaptive and learned regressors are compatible with the > default. Computationally, you might see a factor of 4 slowdown if using > ‘-q’. Nikos noticed that the phenomenal quake inverse square > root<http://en.wikipedia.org/wiki/Fast_inverse_square_root> hack > applies making this substantially faster than a naive implementation. > 3. Nikos and Daniel <http://cseweb.ucsd.edu/~djhsu/> also implemented > active learning derived from this > paper<http://cseweb.ucsd.edu/~djhsu/papers/iwal-cal.pdf>, > usable via –active_simulation (to test parameters on an existing supervised > dataset) or –active_learning (to do the real thing). This runs at full > speed > which is much faster than is reasonable in any active learning scenario. We > see this approach dominating supervised learning on all classification > datasets so far, often with far fewer labeled examples required, as the > theory predicts. The learned predictor is compatible with the default. > 4. Olivier <http://research.yahoo.com/Olivier_Chapelle> helped me > implement preconditioned conjugate gradient based on Jonathan > Shewchuk<http://www.cs.berkeley.edu/~jrs/> > ’s > tutorial<http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf>. > This is a *batch* algorithm and hence requires multiple passes over any > dataset to do something useful. Each step of conjugate gradient requires 2 > passes. The advantage of cg is that it converges relatively quickly via the > use of second derivative information. This can be particularly helpful if > your features are of widely differing scales. The use of –regularization > 0.001 (or smaller) is almost required with –conjugate_gradient as it will > otherwise overfit hard. This implementation has two advantages over the > basic approach: it implicitly computes a Hessian in *O(n)* time where * > n* is the number of features and it operates out of core, hence making > it applicable to datasets that don’t conveniently fit in RAM. The learned > predictor is compatible with the default, although you’ll notice that a > factor of 8 more RAM is required when learning. > 5. Matt Hoffman <http://www.cs.princeton.edu/~mdhoffma/> and I > implemented Online Latent Dirichlet > Allocation<http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>. > This code is still experimental and likely to change over the next week. It > really does a minibatch update under the hood. The code appears to be > substantially faster than Matt’s earlier python implementation making this > probably the most efficient LDA anywhere. LDA is still much slower than > online linear learning as it is quite computationally heavy in > comparison—perhaps a good candidate for GPU optimization. > 6. Nikos, Daniel, and I have been experimenting with more online > cluster parallel learning algorithms (–corrective, –backprop, > –delayed_global). We aren’t yet satisfied with these although they are > improving. Details are at the LCCC workshop<http://lccc.eecs.berkeley.edu/> > . > > In addition, Ariel <http://www.yendor.com/> added a test suite, > Shravan<http://research.yahoo.com/Shravan_Narayanamurthy> helped > with ngrams, and there are several other minor new features and bug fixes > including a very subtle one caught by > Vaclav<http://www.occamslab.com/petricek/> > . > > The documentation on the website hasn’t kept up with the code. I’m planning > to rectify that over the next week, and have a new tutorial starting at 2pm > in the LCCC <http://lccc.eecs.berkeley.edu/schedule.html> room for those > interested. Yes, I’ll not be skiing [image: :)] > >
