While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM
From: https://bdataanalytics.biomedcentral.com/articles/10. 1186/s41044-016-0020-2 Table 3 LR learning time in seconds Dataset Spark MLlib Spark ML Flink ECBDL14-10 3 26 181 ECBDL14-30 5 63 815 ECBDL14-50 6 173 1314 ECBDL14-75 8 260 1878 ECBDL14-100 12 415 2566 The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than had been anticipated - yet the latter has been shutdown for several versions of Spark already. What is the thought process behind that decision : *performance matters! *Is there visibility into a meaningful narrowing of that gap?