zhengruifeng commented on issue #27261: [SPARK-30503][ML] OnlineLDAOptimizer 
does not handle persistance correctly
URL: https://github.com/apache/spark/pull/27261#issuecomment-575565245
 
 
   testCode:
   ```scala
   import org.apache.spark.ml.clustering.LDA
   
   val dataset = 
spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")
   
   val lda = new LDA().setK(10).setMaxIter(100).setOptimizer("em")
   
   sc.getPersistentRDDs
   
   val start = System.currentTimeMillis; val model = lda.fit(dataset); val end 
= System.currentTimeMillis; end - start
   
   sc.getPersistentRDDs
   
   sc.getPersistentRDDs.size
   
   sc.getPersistentRDDs.foreach(println)
   ```
   
   this PR:
   ```scala
   start: Long = 1579250257523
   model: org.apache.spark.ml.clustering.LDAModel = DistributedLDAModel: 
uid=lda_2a48ae87b788, k=10, numFeatures=11
   end: Long = 1579250268529
   res1: Long = 11006
   
   
   scala> sc.getPersistentRDDs.foreach(println)
   (2441,EdgeRDD MapPartitionsRDD[2441] at mapPartitions at 
EdgeRDDImpl.scala:119)
   (2438,VertexRDD, VertexRDD ZippedPartitionsRDD2[2438] at zipPartitions at 
VertexRDD.scala:322)
   (29,VertexRDD, VertexRDD ZippedPartitionsRDD2[29] at zipPartitions at 
VertexRDD.scala:322)
   (32,EdgeRDD MapPartitionsRDD[32] at mapPartitions at EdgeRDDImpl.scala:119)
   ```
   
   master:
   ```scala
   scala> val start = System.currentTimeMillis; val model = lda.fit(dataset); 
val end = System.currentTimeMillis; end - start
   start: Long = 1579255989886
   model: org.apache.spark.ml.clustering.LDAModel = DistributedLDAModel: 
uid=lda_f600c29d8e0a, k=10, numFeatures=11
   end: Long = 1579256001181
   res1: Long = 11295
   
   scala> sc.getPersistentRDDs.size
   res2: Int = 106
   ```
   
   There seems no perfermance regression.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to