Re: [Scikit-learn-general] Léon Bottou SGD version 2.0 is out: Averaged SGD

Peter Prettenhofer Wed, 12 Oct 2011 04:38:46 -0700

For the Averaged Perceptron you would usually do the following (given
that you know the number of iterations, i.e. number of training
examples times the number of epochs, in advance):


when you update the j-th weight in the t-th iteration, w_j^{(t)}, you
know that this update will ***remain in each weight vector for the
next T - t iterations *** (where T is the total number of iterations
to perform). So you can keep track of the average weight vector
\bar{w} as you go; each time you update w_j with some constant v you
update \bar{w}_j with (T-t) * v . Finally, you divide each non-zero
value in \bar{w} by T.

So, for the Averaged Perceptron the updates of the averaged weight
vector are still sparse.

The major difference between Averaged Perceptron and ASGD is that you
have to deal with regularization, which boils down to scaling the
weight vector at each iteration by some constant. This might be the
fact that a straight forward approach such as the one above might
break... I have to look into that more carefully...


2011/10/12 Alexandre Passos <alexandre...@gmail.com>:
> On Wed, Oct 12, 2011 at 02:56, Mathieu Blondel <math...@mblondel.org> wrote:
>> On Wed, Oct 12, 2011 at 2:52 PM, Peter Prettenhofer
>> <peter.prettenho...@gmail.com> wrote:
>>
>>> The results in [Xu 2011] are pretty impressive given the simplicity of
>>> the algorithm - we should definitely give it a try. Unfortunately, the
>>> algorithm shares some of the undesirable properties of SGD: you need a
>>> number of heuristics to make it work (e.g. learning rate schedule,
>>> averaging start point t_0)
>>
>> Indeed, averaging has been used for ages in the Perceptron community.
>> CRFsuite has been supporting averaging for quite some time too I
>> think. ASGD's results look indeed impressive, though.
>
> Does anyone know how to implement parameter averaging without touching
> every feature at every iteration? With things like CRFs you easily
> have millions of features, only a few hundred active per example, so
> it's a pain to touch everything all the time. In the page he mentions
> that
>
>    Both the stochastic gradient weights and the averaged weights are
>    represented using a linear transformation that yields efficiency gains
>    for sparse training data.
>
> Does anyone know what format this is?
>
> --
>  - Alexandre
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Léon Bottou SGD version 2.0 is out: Averaged SGD

Reply via email to