Hi Stephen,

Agree with Nick said, the ML vs MLLib comparison test seems to be flawed.

LR in Spark MLLib use SGD, in each iteration during training, SGD only
sample a small fraction of data and do gradient computation, but in each
iteration LBFGS need to aggregate over the whole input dataset. So in each
iteration LBFGS will take a longer time, if dataset is large.

But LBFGS is a kind of quasi-Newton methods so that it converges faster
(nearly converges quadratically), but SGD method is linear convergence, and
we need to tune the step-size for SGD otherwise we may get very slow
convergence speed.

On Sun, Jan 21, 2018 at 11:31 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> At least one of their comparisons is flawed.
>
> The Spark ML version of linear regression (*note* they use linear
> regression and not logistic regression, it is not clear why) uses L-BFGS as
> the solver, not SGD (as MLLIB uses). Hence it is typically going to be
> slower. However, it should in most cases converge to a better solution.
> MLLIB doesn't offer an L-BFGS version for linear regression, but it does
> for logistic regression.
>
> In my view a more sensible comparison would be between LogReg with L-BFGS
> between ML and MLLIB. These should be close to identical since now the
> MLLIB version actually wraps the ML version.
>
> They also don't show any results for algorithm performance (accuracy, AUC
> etc). The better comparison to make is the run-time to achieve the same AUC
> (for example). SGD may be fast, but it may result in a significantly poorer
> solution relative to say L-BFGS.
>
> Note that the "withSGD" algorithms are deprecated in MLLIB partly to move
> users to ML, but also partly because their performance in terms of accuracy
> is relatively poor and the amount of tuning required (e.g. learning rates)
> is high.
>
> They say:
>
> The time difference between Spark MLlib and Spark ML can be explained by
> internally transforming the dataset from DataFrame to RDD in order to use
> the same implementation of the algorithm present in MLlib.
>
> but this is not true for the LR example.
>
> For the feature selection example, it is probably mostly due to the
> conversion, but even then the difference seems larger than what I would
> expect. It would be worth investigating their implementation to see if
> there are other potential underlying causes.
>
>
> On Sun, 21 Jan 2018 at 23:49 Stephen Boesch <java...@gmail.com> wrote:
>
>> While MLLib performed favorably vs Flink it *also *performed favorably
>> vs spark.ml ..  and by an *order of magnitude*.  The following is one of
>> the tables - it is for Logistic Regression.  At that time spark.ML did not
>> yet support SVM
>>
>> From: https://bdataanalytics.biomedcentral.com/articles/10.1186/
>> s41044-016-0020-2
>>
>>
>>
>> Table 3
>>
>> LR learning time in seconds
>>
>> Dataset
>>
>> Spark MLlib
>>
>> Spark ML
>>
>> Flink
>>
>> ECBDL14-10
>>
>> 3
>>
>> 26
>>
>> 181
>>
>> ECBDL14-30
>>
>> 5
>>
>> 63
>>
>> 815
>>
>> ECBDL14-50
>>
>> 6
>>
>> 173
>>
>> 1314
>>
>> ECBDL14-75
>>
>> 8
>>
>> 260
>>
>> 1878
>>
>> ECBDL14-100
>>
>> 12
>>
>> 415
>>
>> 2566
>>
>> The DataFrame based API (spark.ml) is even slower vs the RDD (mllib)
>> than had been anticipated - yet the latter has been shutdown for several
>> versions of Spark already.  What is the thought process behind that
>> decision : *performance matters! *Is there visibility into a meaningful
>> narrowing of that gap?
>>
>

Reply via email to