At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most cases converge to a better solution. MLLIB doesn't offer an L-BFGS version for linear regression, but it does for logistic regression.
In my view a more sensible comparison would be between LogReg with L-BFGS between ML and MLLIB. These should be close to identical since now the MLLIB version actually wraps the ML version. They also don't show any results for algorithm performance (accuracy, AUC etc). The better comparison to make is the run-time to achieve the same AUC (for example). SGD may be fast, but it may result in a significantly poorer solution relative to say L-BFGS. Note that the "withSGD" algorithms are deprecated in MLLIB partly to move users to ML, but also partly because their performance in terms of accuracy is relatively poor and the amount of tuning required (e.g. learning rates) is high. They say: The time difference between Spark MLlib and Spark ML can be explained by internally transforming the dataset from DataFrame to RDD in order to use the same implementation of the algorithm present in MLlib. but this is not true for the LR example. For the feature selection example, it is probably mostly due to the conversion, but even then the difference seems larger than what I would expect. It would be worth investigating their implementation to see if there are other potential underlying causes. On Sun, 21 Jan 2018 at 23:49 Stephen Boesch <java...@gmail.com> wrote: > While MLLib performed favorably vs Flink it *also *performed favorably vs > spark.ml .. and by an *order of magnitude*. The following is one of the > tables - it is for Logistic Regression. At that time spark.ML did not yet > support SVM > > From: > https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-016-0020-2 > > > > Table 3 > > LR learning time in seconds > > Dataset > > Spark MLlib > > Spark ML > > Flink > > ECBDL14-10 > > 3 > > 26 > > 181 > > ECBDL14-30 > > 5 > > 63 > > 815 > > ECBDL14-50 > > 6 > > 173 > > 1314 > > ECBDL14-75 > > 8 > > 260 > > 1878 > > ECBDL14-100 > > 12 > > 415 > > 2566 > > The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than > had been anticipated - yet the latter has been shutdown for several > versions of Spark already. What is the thought process behind that > decision : *performance matters! *Is there visibility into a meaningful > narrowing of that gap? >