Hi Stephen,
Agree with Nick said, the ML vs MLLib comparison test seems to be flawed.
LR in Spark MLLib use SGD, in each iteration during training, SGD only
sample a small fraction of data and do gradient computation, but in each
iteration LBFGS need to aggregate over the whole input dataset. So
At least one of their comparisons is flawed.
The Spark ML version of linear regression (*note* they use linear
regression and not logistic regression, it is not clear why) uses L-BFGS as
the solver, not SGD (as MLLIB uses). Hence it is typically going to be
slower. However, it should in most
While MLLib performed favorably vs Flink it *also *performed favorably vs
spark.ml .. and by an *order of magnitude*. The following is one of the
tables - it is for Logistic Regression. At that time spark.ML did not yet
support SVM
From: https://bdataanalytics.biomedcentral.com/articles/10.