At least one of their comparisons is flawed.

The Spark ML version of linear regression (*note* they use linear
regression and not logistic regression, it is not clear why) uses L-BFGS as
the solver, not SGD (as MLLIB uses). Hence it is typically going to be
slower. However, it should in most cases converge to a better solution.
MLLIB doesn't offer an L-BFGS version for linear regression, but it does
for logistic regression.

In my view a more sensible comparison would be between LogReg with L-BFGS
between ML and MLLIB. These should be close to identical since now the
MLLIB version actually wraps the ML version.

They also don't show any results for algorithm performance (accuracy, AUC
etc). The better comparison to make is the run-time to achieve the same AUC
(for example). SGD may be fast, but it may result in a significantly poorer
solution relative to say L-BFGS.

Note that the "withSGD" algorithms are deprecated in MLLIB partly to move
users to ML, but also partly because their performance in terms of accuracy
is relatively poor and the amount of tuning required (e.g. learning rates)
is high.

They say:

The time difference between Spark MLlib and Spark ML can be explained by
internally transforming the dataset from DataFrame to RDD in order to use
the same implementation of the algorithm present in MLlib.

but this is not true for the LR example.

For the feature selection example, it is probably mostly due to the
conversion, but even then the difference seems larger than what I would
expect. It would be worth investigating their implementation to see if
there are other potential underlying causes.


On Sun, 21 Jan 2018 at 23:49 Stephen Boesch <java...@gmail.com> wrote:

> While MLLib performed favorably vs Flink it *also *performed favorably vs
> spark.ml ..  and by an *order of magnitude*.  The following is one of the
> tables - it is for Logistic Regression.  At that time spark.ML did not yet
> support SVM
>
> From:
> https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0020-2
>
>
>
> Table 3
>
> LR learning time in seconds
>
> Dataset
>
> Spark MLlib
>
> Spark ML
>
> Flink
>
> ECBDL14-10
>
> 3
>
> 26
>
> 181
>
> ECBDL14-30
>
> 5
>
> 63
>
> 815
>
> ECBDL14-50
>
> 6
>
> 173
>
> 1314
>
> ECBDL14-75
>
> 8
>
> 260
>
> 1878
>
> ECBDL14-100
>
> 12
>
> 415
>
> 2566
>
> The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
> had been anticipated - yet the latter has been shutdown for several
> versions of Spark already.  What is the thought process behind that
> decision : *performance matters! *Is there visibility into a meaningful
> narrowing of that gap?
>

Reply via email to