[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization URL: https://github.com/apache/spark/pull/25178#issuecomment-514449691 @srowen How about backporting it to 2.X? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization URL: https://github.com/apache/spark/pull/25178#issuecomment-514041986 The expected cost without range check is `E(cost(apply2)) = log(NNZ)`; while the one with range check is `E(cost(apply3)) = 2 + P(key in range)*log(NNZ)`; The diff is `E(cost(apply3) - cost(apply2)) = 2 - P(key out of range) * log(NNZ)`, so the optimization is high related to the key distribution and the `NNZ`. The above suite suppose the input key is from an uniform distribution. And show that, if the `NNZ` is small, range check will cost extra 10% cost; otherwise, the range check will save about 50% cost. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization URL: https://github.com/apache/spark/pull/25178#issuecomment-514035619 I test the perf among current impl (`apply`) , direct binary-search (`apply2`), binary-seach with extra range check (`apply3`) ``` def apply2(i: Int): Double = { if (i < 0 || i >= size) { throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, $size)") } val j = util.Arrays.binarySearch(indices, i) if (j < 0) 0.0 else values(j) } def apply3(i: Int): Double = { if (i < 0 || i >= size) { throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, $size)") } if (indices.isEmpty || i < indices(0) || i > indices(indices.length - 1)) { 0.0 } else { val j = util.Arrays.binarySearch(indices, i) if (j < 0) 0.0 else values(j) } } ``` the test suite is similar with the above one ``` import scala.util.Random import org.apache.spark.ml.linalg._ val size = 1000 for (nnz <- Seq(100, 1, 100)) { val rng = new Random(123) val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % size).distinct.sorted.take(nnz) val values = Array.fill(nnz)(rng.nextDouble) val vec = Vectors.sparse(size, indices, values).toSparse val tic1 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec(i); i+=1} }; val toc1 = System.currentTimeMillis; val tic2 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply2(i); i+=1} }; val toc2 = System.currentTimeMillis; val tic3 = System.currentTimeMillis; (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < size) {sum+=vec.apply3(i); i+=1} }; val toc3 = System.currentTimeMillis; println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3)) } ``` | size| nnz | apply(old) | apply2 | apply3| |--|--||--|--| |1000|100|75294|13744|15318| |1000|1|75616|33543|21112| |1000|100|92949|52483|34215| The `apply2` here is faster only if `nnz` is 100 in the suite, I guess that that is because the seach complexity is realy small and the cost of extra two range check can not be ignored. But when `nnz` is relative a greater number (however still small), `apply3` is faster than `apply2`. I guess that is due to avoid calling `Arrays.binarySearch`, `Arrays.binarySearch` do not check the range internally and still perferm `log(nnz)` check even if the input key is out of range. In sum, I tend to keep current pr with extra range check. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization URL: https://github.com/apache/spark/pull/25178#issuecomment-514027613 I added the extra range check just because in `breeze.collection.mutable.SparseArray`, the `findOffset` function does this and comments `special case for end of list - this is a big win for growing sparse arrays` @srowen @dongjoon-hyun @kiszk I will do anothe simple test to find whether the extra range chek helps. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org