Hi Stephen, Agree with Nick said, the ML vs MLLib comparison test seems to be flawed.
LR in Spark MLLib use SGD, in each iteration during training, SGD only sample a small fraction of data and do gradient computation, but in each iteration LBFGS need to aggregate over the whole input dataset. So in each iteration LBFGS will take a longer time, if dataset is large. But LBFGS is a kind of quasi-Newton methods so that it converges faster (nearly converges quadratically), but SGD method is linear convergence, and we need to tune the step-size for SGD otherwise we may get very slow convergence speed. On Sun, Jan 21, 2018 at 11:31 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > At least one of their comparisons is flawed. > > The Spark ML version of linear regression (*note* they use linear > regression and not logistic regression, it is not clear why) uses L-BFGS as > the solver, not SGD (as MLLIB uses). Hence it is typically going to be > slower. However, it should in most cases converge to a better solution. > MLLIB doesn't offer an L-BFGS version for linear regression, but it does > for logistic regression. > > In my view a more sensible comparison would be between LogReg with L-BFGS > between ML and MLLIB. These should be close to identical since now the > MLLIB version actually wraps the ML version. > > They also don't show any results for algorithm performance (accuracy, AUC > etc). The better comparison to make is the run-time to achieve the same AUC > (for example). SGD may be fast, but it may result in a significantly poorer > solution relative to say L-BFGS. > > Note that the "withSGD" algorithms are deprecated in MLLIB partly to move > users to ML, but also partly because their performance in terms of accuracy > is relatively poor and the amount of tuning required (e.g. learning rates) > is high. > > They say: > > The time difference between Spark MLlib and Spark ML can be explained by > internally transforming the dataset from DataFrame to RDD in order to use > the same implementation of the algorithm present in MLlib. > > but this is not true for the LR example. > > For the feature selection example, it is probably mostly due to the > conversion, but even then the difference seems larger than what I would > expect. It would be worth investigating their implementation to see if > there are other potential underlying causes. > > > On Sun, 21 Jan 2018 at 23:49 Stephen Boesch <java...@gmail.com> wrote: > >> While MLLib performed favorably vs Flink it *also *performed favorably >> vs spark.ml .. and by an *order of magnitude*. The following is one of >> the tables - it is for Logistic Regression. At that time spark.ML did not >> yet support SVM >> >> From: https://bdataanalytics.biomedcentral.com/articles/10.1186/ >> s41044-016-0020-2 >> >> >> >> Table 3 >> >> LR learning time in seconds >> >> Dataset >> >> Spark MLlib >> >> Spark ML >> >> Flink >> >> ECBDL14-10 >> >> 3 >> >> 26 >> >> 181 >> >> ECBDL14-30 >> >> 5 >> >> 63 >> >> 815 >> >> ECBDL14-50 >> >> 6 >> >> 173 >> >> 1314 >> >> ECBDL14-75 >> >> 8 >> >> 260 >> >> 1878 >> >> ECBDL14-100 >> >> 12 >> >> 415 >> >> 2566 >> >> The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) >> than had been anticipated - yet the latter has been shutdown for several >> versions of Spark already. What is the thought process behind that >> decision : *performance matters! *Is there visibility into a meaningful >> narrowing of that gap? >> >