Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 End-to-End TEST Situation: Use below code to test ` test("kmeanproblem") { val rdd = sc .textFile("/Users/liliang/Desktop/inputdata.txt") .map(f => f.split(",") .map(f => f.toDouble)) val vectorRdd = rdd.map(f => Vectors.dense(f)) val startTime = System.currentTimeMillis() for (i <- 0 until 20) { val model = new KMeans() .setK(8) .setMaxIterations(100) .setInitializationMode(K_MEANS_PARALLEL) .run(vectorRdd) } val endTime = System.currentTimeMillis() // scalastyle:off println println("cost time: " + (endTime - startTime)) // scalastyle:on println ` Input Data: extract 57216 items from the data (http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals) to form the test input data Test Result: Before: cost time is 297686 milliseconds (consider the worst situation) After add patch: cost time is 180544 milliseconds (consider the worst situation) Function Test Situation: Only test function fastSquaredDistance in below situation: call fastSquaredDistance function 100000000 times before and after added patch respectively Input Data: 1 2 3 4 3 4 5 6 7 8 9 0 1 3 4 6 7 4 2 2 5 7 8 9 3 2 3 5 7 8 9 3 3 2 1 1 2 2 9 3 3 4 5 4 5 2 1 5 6 3 2 1 3 4 6 7 8 9 0 3 2 1 2 3 4 5 6 7 8 5 3 2 1 4 5 6 7 8 4 3 2 4 6 7 8 9 Test Result: Before: cost time is 8395 milliseconds After added patch: cost time is 5448 milliseconds So according to above test, we can conclude that the patch give a better performance for function fastSquaredDistance in spark k-mean mode. ( further more the sqDist = Vectors.sqdist(v1, v2) is better than sqDist = sumSquaredNorm - 2.0 * dot(v1, v2) in calculation performance )
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org