[GitHub] [spark] zhengruifeng commented on issue #26803: [SPARK-30178][ML] RobustScaler support large numFeatures

GitBox Mon, 16 Dec 2019 23:18:39 -0800

zhengruifeng commented on issue #26803: [SPARK-30178][ML] RobustScaler support 
large numFeatures
URL: https://github.com/apache/spark/pull/26803#issuecomment-566414715
 
 
   test code:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.feature._
   import org.apache.spark.storage.StorageLevel
   
   val rdd = sc.range(0, 10000000, 1, 100)
   val df = rdd.map(i => Tuple1.apply(Vectors.dense((i % 1000).toDouble / 
1000))).toDF("features")
   df.persist(StorageLevel.MEMORY_AND_DISK)
   df.count
   
   
   val scaler = new RobustScaler().setInputCol("features")
   
   val start = System.currentTimeMillis; Seq.range(0, 100).foreach{_ => val 
model = scaler.fit(df)}; val end = System.currentTimeMillis
   
   end - start
   ```
   
   Master: 243493
   This PR: 285341
   I test an edge case with only numFeatures=1, and existing impl is about 17% 
faster then this PR.
   
   That is to say this PR will support medium/large (>1000) numFeatures at the 
cost of some performance regression on low-dim cases.
   Or we check the numFeatures at first, and decide which method to use?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on issue #26803: [SPARK-30178][ML] RobustScaler support large numFeatures

Reply via email to