Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-22 Thread Weichen Xu
Hi Stephen, Agree with Nick said, the ML vs MLLib comparison test seems to be flawed. LR in Spark MLLib use SGD, in each iteration during training, SGD only sample a small fraction of data and do gradient computation, but in each iteration LBFGS need to aggregate over the whole input dataset. So

Re: Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Nick Pentreath
At least one of their comparisons is flawed. The Spark ML version of linear regression (*note* they use linear regression and not logistic regression, it is not clear why) uses L-BFGS as the solver, not SGD (as MLLIB uses). Hence it is typically going to be slower. However, it should in most

Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs spark.ml .. and by an *order of magnitude*. The following is one of the tables - it is for Logistic Regression. At that time spark.ML did not yet support SVM From: https://bdataanalytics.biomedcentral.com/articles/10.