Re: [Scikit-learn-general] Using sklearn in Hadoop

Nick Pentreath Mon, 04 Feb 2013 06:51:34 -0800

@Robert sorry for the delay in responding, I was away on vacation.

Here's a link to a gist of a very simple implementation of parallelized SGD
using Spark (https://gist.github.com/4707012). It basically replicates the
existing Spark logistic regression example, but using sklearn's
linear_model module. However the approach used is iterative parameter
mixtures (where the local weight vectors are averaged and the resulting
weight vector rebroadcast) as opposed to distributed gradient descent
(where the local gradients are aggregated, a gradient step taken on the
master and the weight vector rebroadcast) - see
http://faculty.utpa.edu/reillycf/courses/CSCI6175-F11/papers/nips2010mannetal.pdffor
some details.

This is partly because sklearn doesn't give access to the gradients in any
case (as far as I can tell), but does give access to params (.coef_ and
.intercept_), and partly because IPM appears superior in terms of
wall-clock speed in the paper with equivalent accuracy, at least for SGD.

(As an side, interestingly Vowpal Wabbit's standard approach is 1 pass of
SGD with (weighted) averaging and then distributed gradients but using
LBFGS on each node, which gives quick convergence to a good solution with
SGD, then getting the rest of the way to the "best" solution with LBFGS).

As you can see, this simple version of cluster-distributed SGD with sklearn
and Spark inherits a lot of sklearn's power (e.g. we can set learning
rates, loss types for classification / regression, etc), with the only
additional code needed being a training function and one for merging models.

Nick

On Sun, Jan 27, 2013 at 8:01 PM, Robert Kern <robert.k...@gmail.com> wrote:

> On Thu, Jan 24, 2013 at 10:06 AM, Nick Pentreath
> <nick.pentre...@gmail.com> wrote:
> > May I suggest you look at Spark (http://spark-project.org/ and
> > https://github.com/mesos/spark).
> >
> > It is written in Scala, has a Java API and the current master branch has
> the
> > new Python API (0.7.0 release when it happens). I've been doing some
> > testing, including using sklearn together with Spark, and so far it looks
> > good. The bonus is no Hadoop MapReduce (but fully HDFS compatible if you
> > need the filesystem), and you can write all your code directly in Python.
>
> I've been keeping an interested eye on the Spark project for a while
> now. Can you share any sklearn+Spark examples that you've worked up so
> far?
>
> --
> Robert Kern
>
>
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. ON SALE this month only -- learn more at:
> http://p.sf.net/sfu/learnnow-d2d
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Using sklearn in Hadoop

Reply via email to