[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization

2019-07-23 Thread GitBox
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply 
performance optimization
URL: https://github.com/apache/spark/pull/25178#issuecomment-514449691
 
 
   @srowen  How about backporting it to 2.X?  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization

2019-07-22 Thread GitBox
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply 
performance optimization
URL: https://github.com/apache/spark/pull/25178#issuecomment-514041986
 
 
   The expected cost without range check is `E(cost(apply2)) = log(NNZ)`;
   while the one with range check is `E(cost(apply3)) = 2 + P(key in 
range)*log(NNZ)`;
   The diff is `E(cost(apply3) - cost(apply2)) = 2 - P(key out of range) * 
log(NNZ)`, so the optimization is high related to the key distribution and the 
`NNZ`.
   The above suite suppose the input key is from an uniform distribution. And 
show that, if the `NNZ` is small, range check will cost extra 10% cost; 
otherwise, the range check will save about 50% cost.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization

2019-07-22 Thread GitBox
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply 
performance optimization
URL: https://github.com/apache/spark/pull/25178#issuecomment-514035619
 
 
   I test the perf among current impl (`apply`) , direct binary-search 
(`apply2`), binary-seach with extra range check (`apply3`)
   ```
 def apply2(i: Int): Double = {
   if (i < 0 || i >= size) {
 throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, 
$size)")
   }
   
   val j = util.Arrays.binarySearch(indices, i)
   if (j < 0) 0.0 else values(j)
 }
   
 def apply3(i: Int): Double = {
   if (i < 0 || i >= size) {
 throw new IndexOutOfBoundsException(s"Index $i out of bounds [0, 
$size)")
   }
   
   if (indices.isEmpty || i < indices(0) || i > indices(indices.length - 
1)) {
 0.0
   } else {
 val j = util.Arrays.binarySearch(indices, i)
 if (j < 0) 0.0 else values(j)
   }
 }
   ```
   
   the test suite is similar with the above one
   ```
   import scala.util.Random
   import org.apache.spark.ml.linalg._
   
   val size = 1000
   for (nnz <- Seq(100, 1, 100)) {
val rng = new Random(123)
val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % 
size).distinct.sorted.take(nnz)
val values = Array.fill(nnz)(rng.nextDouble)
val vec = Vectors.sparse(size, indices, values).toSparse
   
val tic1 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < 
size) {sum+=vec(i); i+=1} };
val toc1 = System.currentTimeMillis;
   
val tic2 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < 
size) {sum+=vec.apply2(i); i+=1} };
val toc2 = System.currentTimeMillis;
   
val tic3 = System.currentTimeMillis;
(0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < 
size) {sum+=vec.apply3(i); i+=1} };
val toc3 = System.currentTimeMillis;

println((size, nnz, toc1 - tic1, toc2 - tic2, toc3 - tic3))
   }
   ```
   
   
   
   | size|  nnz | apply(old) | apply2 | apply3|
   |--|--||--|--|
   |1000|100|75294|13744|15318|
   |1000|1|75616|33543|21112|
   |1000|100|92949|52483|34215|
   
   The `apply2` here is faster only if `nnz` is 100 in the suite, I guess that 
that is because the seach complexity is realy small and the cost of extra two 
range check  can not be ignored.
   But when `nnz` is relative a greater number (however still small), `apply3` 
is faster than `apply2`. I guess that is due to avoid calling 
`Arrays.binarySearch`, `Arrays.binarySearch` do not check the range internally 
and still perferm `log(nnz)` check even if the input key is out of range.
   In sum, I tend to keep current pr with extra range check.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply performance optimization

2019-07-22 Thread GitBox
zhengruifeng commented on issue #25178: [SPARK-28421][ML] SparseVector.apply 
performance optimization
URL: https://github.com/apache/spark/pull/25178#issuecomment-514027613
 
 
   I added the extra range check just because in 
`breeze.collection.mutable.SparseArray`, the `findOffset` function does this 
and comments `special case for end of list - this is a big win for growing 
sparse arrays`
   
   @srowen @dongjoon-hyun  @kiszk  I will do anothe simple test to find whether 
the extra range chek helps.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org