For the Averaged Perceptron you would usually do the following (given that you know the number of iterations, i.e. number of training examples times the number of epochs, in advance):
when you update the j-th weight in the t-th iteration, w_j^{(t)}, you know that this update will ***remain in each weight vector for the next T - t iterations *** (where T is the total number of iterations to perform). So you can keep track of the average weight vector \bar{w} as you go; each time you update w_j with some constant v you update \bar{w}_j with (T-t) * v . Finally, you divide each non-zero value in \bar{w} by T. So, for the Averaged Perceptron the updates of the averaged weight vector are still sparse. The major difference between Averaged Perceptron and ASGD is that you have to deal with regularization, which boils down to scaling the weight vector at each iteration by some constant. This might be the fact that a straight forward approach such as the one above might break... I have to look into that more carefully... 2011/10/12 Alexandre Passos <alexandre...@gmail.com>: > On Wed, Oct 12, 2011 at 02:56, Mathieu Blondel <math...@mblondel.org> wrote: >> On Wed, Oct 12, 2011 at 2:52 PM, Peter Prettenhofer >> <peter.prettenho...@gmail.com> wrote: >> >>> The results in [Xu 2011] are pretty impressive given the simplicity of >>> the algorithm - we should definitely give it a try. Unfortunately, the >>> algorithm shares some of the undesirable properties of SGD: you need a >>> number of heuristics to make it work (e.g. learning rate schedule, >>> averaging start point t_0) >> >> Indeed, averaging has been used for ages in the Perceptron community. >> CRFsuite has been supporting averaging for quite some time too I >> think. ASGD's results look indeed impressive, though. > > Does anyone know how to implement parameter averaging without touching > every feature at every iteration? With things like CRFs you easily > have millions of features, only a few hundred active per example, so > it's a pain to touch everything all the time. In the page he mentions > that > > Both the stochastic gradient weights and the averaged weights are > represented using a linear transformation that yields efficiency gains > for sparse training data. > > Does anyone know what format this is? > > -- > - Alexandre > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2d-oct > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general