Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

Weichen Xu Mon, 22 Jan 2018 11:14:09 -0800

Hi Stephen,

Agree with Nick said, the ML vs MLLib comparison test seems to be flawed.


LR in Spark MLLib use SGD, in each iteration during training, SGD only
sample a small fraction of data and do gradient computation, but in each
iteration LBFGS need to aggregate over the whole input dataset. So in each
iteration LBFGS will take a longer time, if dataset is large.

But LBFGS is a kind of quasi-Newton methods so that it converges faster
(nearly converges quadratically), but SGD method is linear convergence, and
we need to tune the step-size for SGD otherwise we may get very slow
convergence speed.

On Sun, Jan 21, 2018 at 11:31 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> At least one of their comparisons is flawed.
>
> The Spark ML version of linear regression (*note* they use linear
> regression and not logistic regression, it is not clear why) uses L-BFGS as
> the solver, not SGD (as MLLIB uses). Hence it is typically going to be
> slower. However, it should in most cases converge to a better solution.
> MLLIB doesn't offer an L-BFGS version for linear regression, but it does
> for logistic regression.
>
> In my view a more sensible comparison would be between LogReg with L-BFGS
> between ML and MLLIB. These should be close to identical since now the
> MLLIB version actually wraps the ML version.
>
> They also don't show any results for algorithm performance (accuracy, AUC
> etc). The better comparison to make is the run-time to achieve the same AUC
> (for example). SGD may be fast, but it may result in a significantly poorer
> solution relative to say L-BFGS.
>
> Note that the "withSGD" algorithms are deprecated in MLLIB partly to move
> users to ML, but also partly because their performance in terms of accuracy
> is relatively poor and the amount of tuning required (e.g. learning rates)
> is high.
>
> They say:
>
> The time difference between Spark MLlib and Spark ML can be explained by
> internally transforming the dataset from DataFrame to RDD in order to use
> the same implementation of the algorithm present in MLlib.
>
> but this is not true for the LR example.
>
> For the feature selection example, it is probably mostly due to the
> conversion, but even then the difference seems larger than what I would
> expect. It would be worth investigating their implementation to see if
> there are other potential underlying causes.
>
>
> On Sun, 21 Jan 2018 at 23:49 Stephen Boesch <java...@gmail.com> wrote:
>
>> While MLLib performed favorably vs Flink it *also *performed favorably
>> vs spark.ml ..  and by an *order of magnitude*.  The following is one of
>> the tables - it is for Logistic Regression.  At that time spark.ML did not
>> yet support SVM
>>
>> From: https://bdataanalytics.biomedcentral.com/articles/10.1186/
>> s41044-016-0020-2
>>
>>
>>
>> Table 3
>>
>> LR learning time in seconds
>>
>> Dataset
>>
>> Spark MLlib
>>
>> Spark ML
>>
>> Flink
>>
>> ECBDL14-10
>>
>> 3
>>
>> 26
>>
>> 181
>>
>> ECBDL14-30
>>
>> 5
>>
>> 63
>>
>> 815
>>
>> ECBDL14-50
>>
>> 6
>>
>> 173
>>
>> 1314
>>
>> ECBDL14-75
>>
>> 8
>>
>> 260
>>
>> 1878
>>
>> ECBDL14-100
>>
>> 12
>>
>> 415
>>
>> 2566
>>
>> The DataFrame based API (spark.ml) is even slower vs the RDD (mllib)
>> than had been anticipated - yet the latter has been shutdown for several
>> versions of Spark already.  What is the thought process behind that
>> decision : *performance matters! *Is there visibility into a meaningful
>> narrowing of that gap?
>>
>

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

Reply via email to