Has there been any explanation on the performance degradation between spark.ml and Mllib?

Stephen Boesch Sun, 21 Jan 2018 13:50:20 -0800

While MLLib performed favorably vs Flink it *also *performed favorably vs
spark.ml ..  and by an *order of magnitude*.  The following is one of the
tables - it is for Logistic Regression.  At that time spark.ML did not yet
support SVM


From: https://bdataanalytics.biomedcentral.com/articles/10.
1186/s41044-016-0020-2



Table 3

LR learning time in seconds

Dataset

Spark MLlib

Spark ML

Flink

ECBDL14-10

3

26

181

ECBDL14-30

5

63

815

ECBDL14-50

6

173

1314

ECBDL14-75

8

260

1878

ECBDL14-100

12

415

2566

The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
had been anticipated - yet the latter has been shutdown for several
versions of Spark already.  What is the thought process behind that
decision : *performance matters! *Is there visibility into a meaningful
narrowing of that gap?

Has there been any explanation on the performance degradation between spark.ml and Mllib?

Reply via email to