Re: Is Apache Spark less accurate than Scikit Learn?

Robin East Thu, 22 Jan 2015 00:02:33 -0800

Hi

There are many different variants of gradient descent mostly dealing with what 
the step size is and how it might be adjusted as the algorithm proceeds. Also 
if it uses a stochastic variant (as opposed to batch descent) then there are 
variations there too. I don’t know off-hand what MLlib’s detailed 
implementation is but no doubt there are differences between the two - perhaps 
someone with more knowledge of the internals could comment.


As you can tell from playing around with the parameters, step size is vitally 
important to the performance of the algorithm.


On 22 Jan 2015, at 06:44, Jacques Heunis <jaaksem...@gmail.com> wrote:

> Ah I see, thanks!
> I was just confused because given the same configuration, I would have 
> thought that Spark and Scikit would give more similar results, but I guess 
> this is simply not the case (as in your example, in order to get spark to 
> give an mse sufficiently close to scikit's you have to give it a 
> significantly larger step and iteration count).
> 
> Would that then be a result of MLLib and Scikit differing slightly in their 
> exact implementation of the optimizer? Or rather a case of (as you say) 
> Scikit being a far more mature system (and therefore that MLLib would 'get 
> better' over time)? Surely it is far from ideal that to get the same results 
> you need more iterations (IE more computation), or do you think that that is 
> simply coincidence and that given a different model/dataset it may be the 
> other way around?
> 
> I ask because I encountered this situation on other, larger datasets, so this 
> is not an isolated case (though being the simplest example I could think of I 
> would imagine that it's somewhat indicative of general behaviour)
> 
> On Thu, Jan 22, 2015 at 1:57 AM, Robin East <robin.e...@xense.co.uk> wrote:
> I don’t get those results. I get:
> 
> spark           0.14
> scikit-learn    0.85
> 
> The scikit-learn mse is due to the very low eta0 setting. Tweak that to 0.1 
> and push iterations to 400 and you get a mse ~= 0. Of course the coefficients 
> are both ~1 and the intercept ~0. Similarly if you change the mllib step size 
> to 0.5 and number of iterations to 1200 you again get a very low mse. One of 
> the issues with SGD is you have to tweak these parameters to tune the 
> algorithm.
> 
> FWIW I wouldn’t see Spark MLlib as a replacement for scikit-learn. MLLib is 
> nowhere as mature as scikit learn. However if you have large datasets that 
> won’t sensibly fit the scikit-learn in-core model MLLib is one of the top 
> choices. Similarly if you are running proof of concepts that you are 
> eventually going to scale up to production environments then there is a 
> definite argument for using MLlib at both the PoC and production stages.
> 
> 
> On 21 Jan 2015, at 20:39, JacquesH <jaaksem...@gmail.com> wrote:
> 
> > I've recently been trying to get to know Apache Spark as a replacement for
> > Scikit Learn, however it seems to me that even in simple cases, Scikit
> > converges to an accurate model far faster than Spark does.
> > For example I generated 1000 data points for a very simple linear function
> > (z=x+y) with the following script:
> >
> > http://pastebin.com/ceRkh3nb
> >
> > I then ran the following Scikit script:
> >
> > http://pastebin.com/1aECPfvq
> >
> > And then this Spark script: (with spark-submit <filename>, no other
> > arguments)
> >
> > http://pastebin.com/s281cuTL
> >
> > Strangely though, the error given by spark is an order of magnitude larger
> > than that given by Scikit (0.185 and 0.045 respectively) despite the two
> > models having a nearly identical setup (as far as I can tell)
> > I understand that this is using SGD with very few iterations and so the
> > results may differ but I wouldn't have thought that it would be anywhere
> > near such a large difference or such a large error, especially given the
> > exceptionally simple data.
> >
> > Is there something I'm misunderstanding in Spark? Is it not correctly
> > configured? Surely I should be getting a smaller error than that?
> >
> >
> >
> > --
> > View this message in context: 
> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Apache-Spark-less-accurate-than-Scikit-Learn-tp21301.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> 
>

Re: Is Apache Spark less accurate than Scikit Learn?

Reply via email to