[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

GitBox Sun, 26 Apr 2020 20:30:46 -0700


zhengruifeng commented on pull request #28349:
URL: https://github.com/apache/spark/pull/28349#issuecomment-619692445



   I also test on sparse dataset:
   ```
   import org.apache.spark.ml.classification._
   import org.apache.spark.storage.StorageLevel
   
   val df = spark.read.option("numFeatures", 
"8289919").format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label",
 (col("label")+1)/2)
   
   val svc = new LinearSVC().setMaxIter(10)
   svc.fit(df)
   
   val start = System.currentTimeMillis; val model1 = 
svc.setMaxIter(30).fit(df); val end = System.currentTimeMillis; end - start
   ```
   
   results: 
   this PR:
   ```
   scala> val start = System.currentTimeMillis; val model1 = 
svc.setMaxIter(30).fit(df); val end = System.currentTimeMillis; end - start
   start: Long = 1587957534286                                                  
   
   model1: org.apache.spark.ml.classification.LinearSVCModel = LinearSVCModel: 
uid=linearsvc_2fcd0abbb2d7, numClasses=2, numFeatures=8289919
   end: Long = 1587957684508
   res1: Long = 150222
   ```
   
   Master:
   ```
   scala> val start = System.currentTimeMillis; val model1 = 
svc.setMaxIter(30).fit(df); val end = System.currentTimeMillis; end - start
   start: Long = 1587957959670                                                  
   
   model1: org.apache.spark.ml.classification.LinearSVCModel = LinearSVCModel: 
uid=linearsvc_269e4f373d2c, numClasses=2, numFeatures=8289919
   end: Long = 1587958111562
   res1: Long = 151892
   ```
   
   If we keep `blockSIze=1`, then there is no performance regression on sparse 
dataset.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

Reply via email to