zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-619692445
I also test on sparse dataset: ``` import org.apache.spark.ml.classification._ import org.apache.spark.storage.StorageLevel val df = spark.read.option("numFeatures", "8289919").format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label", (col("label")+1)/2) val svc = new LinearSVC().setMaxIter(10) svc.fit(df) val start = System.currentTimeMillis; val model1 = svc.setMaxIter(30).fit(df); val end = System.currentTimeMillis; end - start ``` results: this PR: ``` scala> val start = System.currentTimeMillis; val model1 = svc.setMaxIter(30).fit(df); val end = System.currentTimeMillis; end - start start: Long = 1587957534286 model1: org.apache.spark.ml.classification.LinearSVCModel = LinearSVCModel: uid=linearsvc_2fcd0abbb2d7, numClasses=2, numFeatures=8289919 end: Long = 1587957684508 res1: Long = 150222 ``` Master: ``` scala> val start = System.currentTimeMillis; val model1 = svc.setMaxIter(30).fit(df); val end = System.currentTimeMillis; end - start start: Long = 1587957959670 model1: org.apache.spark.ml.classification.LinearSVCModel = LinearSVCModel: uid=linearsvc_269e4f373d2c, numClasses=2, numFeatures=8289919 end: Long = 1587958111562 res1: Long = 151892 ``` If we keep `blockSIze=1`, then there is no performance regression on sparse dataset. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org