zhengruifeng commented on issue #26803: [SPARK-30178][ML] RobustScaler support large numFeatures URL: https://github.com/apache/spark/pull/26803#issuecomment-566414715 test code: ```scala import org.apache.spark.ml.linalg._ import org.apache.spark.ml.feature._ import org.apache.spark.storage.StorageLevel val rdd = sc.range(0, 10000000, 1, 100) val df = rdd.map(i => Tuple1.apply(Vectors.dense((i % 1000).toDouble / 1000))).toDF("features") df.persist(StorageLevel.MEMORY_AND_DISK) df.count val scaler = new RobustScaler().setInputCol("features") val start = System.currentTimeMillis; Seq.range(0, 100).foreach{_ => val model = scaler.fit(df)}; val end = System.currentTimeMillis end - start ``` Master: 243493 This PR: 285341 I test an edge case with only numFeatures=1, and existing impl is about 17% faster then this PR. That is to say this PR will support medium/large (>1000) numFeatures at the cost of some performance regression on low-dim cases. Or we check the numFeatures at first, and decide which method to use?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org