Re: Vowpal Wabbit 5.0 with optimisations

Ted Dunning Thu, 23 Dec 2010 12:37:13 -0800

Yeah... some of his changes are actually tracking some of our stuff (not by
looking at our stuff, but looking at the common upstream idea source).  The
active learning stuff is a different case and would be excellent to have.
 Essentially that involves sieving out training examples that "don't matter"
which, in my experience, is typically 99+% of the data.  In practice,
down-sampling is the poor man's active learning, but real active learning
would be much, much better.


On Thu, Dec 23, 2010 at 10:46 AM, Robin Anil <[email protected]> wrote:

> Worth a look, maybe some new ideas for sgd package
>
> Robin
>
> http://hunch.net/?p=1594
>
> Vowpal Wabbit, version 5.0, and the second heresy<http://hunch.net/?p=1594>
> Tags: Announcements <http://hunch.net/?cat=4>, Code<http://hunch.net/?cat=42>
> , Machine Learning <http://hunch.net/?cat=29> — jl <http://hunch.net/~jl>@
> 9:52 pm
>
> I’ve released version 5.0 <http://hunch.net/~vw/vowpal_wabbit_v5.0.tar.gz> of
> the Vowpal Wabbit <http://hunch.net/~jl> online learning software. The
> major number has changed since the last release 
> <http://hunch.net/?p=1068>because
> I regard all earlier versions as obsolete—there are several new algorithms &
> features including substantial changes and upgrades to the default learning
> algorithm.
>
> The biggest changes are new algorithms:
>
>    1. Nikos <http://www.cs.cornell.edu/~nk/> and I improved the default
>    algorithm. The basic update rule still uses gradient descent, but the size
>    of the update is carefully controlled so that it’s impossible to overrun 
> the
>    label. In addition, the normalization has changed. Computationally, these
>    changes are virtually free and yield better results, sometimes much better.
>    Less careful updates can be reenabled with –loss_function classic, although
>    results are still not identical to previous due to normalization changes.
>    2. Nikos also implemented the per-feature learning rates as per these
>    two <http://www.colt2010.org/papers/104mcmahan.pdf> 
> papers<http://www.colt2010.org/papers/023Duchi.pdf>.
>    Often, this works better than the default algorithm. It isn’t the default
>    because it isn’t (yet) as adaptable in terms of learning rate decay. This 
> is
>    enabled with –adaptive and learned regressors are compatible with the
>    default. Computationally, you might see a factor of 4 slowdown if using
>    ‘-q’. Nikos noticed that the phenomenal quake inverse square 
> root<http://en.wikipedia.org/wiki/Fast_inverse_square_root> hack
>    applies making this substantially faster than a naive implementation.
>    3. Nikos and Daniel <http://cseweb.ucsd.edu/~djhsu/> also implemented
>    active learning derived from this 
> paper<http://cseweb.ucsd.edu/~djhsu/papers/iwal-cal.pdf>,
>    usable via –active_simulation (to test parameters on an existing supervised
>    dataset) or –active_learning (to do the real thing). This runs at full 
> speed
>    which is much faster than is reasonable in any active learning scenario. We
>    see this approach dominating supervised learning on all classification
>    datasets so far, often with far fewer labeled examples required, as the
>    theory predicts. The learned predictor is compatible with the default.
>    4. Olivier <http://research.yahoo.com/Olivier_Chapelle> helped me
>    implement preconditioned conjugate gradient based on Jonathan 
> Shewchuk<http://www.cs.berkeley.edu/~jrs/>
>    ’s 
> tutorial<http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf>.
>    This is a *batch* algorithm and hence requires multiple passes over any
>    dataset to do something useful. Each step of conjugate gradient requires 2
>    passes. The advantage of cg is that it converges relatively quickly via the
>    use of second derivative information. This can be particularly helpful if
>    your features are of widely differing scales. The use of –regularization
>    0.001 (or smaller) is almost required with –conjugate_gradient as it will
>    otherwise overfit hard. This implementation has two advantages over the
>    basic approach: it implicitly computes a Hessian in *O(n)* time where *
>    n* is the number of features and it operates out of core, hence making
>    it applicable to datasets that don’t conveniently fit in RAM. The learned
>    predictor is compatible with the default, although you’ll notice that a
>    factor of 8 more RAM is required when learning.
>    5. Matt Hoffman <http://www.cs.princeton.edu/~mdhoffma/> and I
>    implemented Online Latent Dirichlet 
> Allocation<http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf>.
>    This code is still experimental and likely to change over the next week. It
>    really does a minibatch update under the hood. The code appears to be
>    substantially faster than Matt’s earlier python implementation making this
>    probably the most efficient LDA anywhere. LDA is still much slower than
>    online linear learning as it is quite computationally heavy in
>    comparison—perhaps a good candidate for GPU optimization.
>    6. Nikos, Daniel, and I have been experimenting with more online
>    cluster parallel learning algorithms (–corrective, –backprop,
>    –delayed_global). We aren’t yet satisfied with these although they are
>    improving. Details are at the LCCC workshop<http://lccc.eecs.berkeley.edu/>
>    .
>
> In addition, Ariel <http://www.yendor.com/> added a test suite, 
> Shravan<http://research.yahoo.com/Shravan_Narayanamurthy> helped
> with ngrams, and there are several other minor new features and bug fixes
> including a very subtle one caught by 
> Vaclav<http://www.occamslab.com/petricek/>
> .
>
> The documentation on the website hasn’t kept up with the code. I’m planning
> to rectify that over the next week, and have a new tutorial starting at 2pm
> in the LCCC <http://lccc.eecs.berkeley.edu/schedule.html> room for those
> interested. Yes, I’ll not be skiing [image: :)]
>
>

Re: Vowpal Wabbit 5.0 with optimisations

Reply via email to