[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

KyleLi1985 Thu, 01 Nov 2018 04:52:50 -0700

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > I don't think BLAS matters here as these are all vector-vector operations 
and f2jblas is used directly (i.e. stays in the JVM).
    > 
    > Are all the vectors dense? I suppose I'm still surprised if sqdist is 
faster than dot here as it ought to be a little more math. The sparse-dense 
case might come out differently, note.
    > 
    > And I suppose I have a hard time believing that the sparse-sparse case is 
faster after this change (when the precision bound is met) because now it's 
handled in the sparse-sparse if case in this code, which definitely does a dot 
plus more work.
    > 
    > (If you did remove this check you could remove some other values that get 
computed to check this bound, like precision1)
    
    We use only "Vectors Dense", here is the test file
    
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2538356/SparkMLlibTest.txt)
    I extract the relevant part from code and compare the performance, The 
result show in Vectors Dense situation the sqdist is fasterã
    And for End-to-End test, I consider the worst situation, input vector are 
all dense and the precision is not OK!
    
    `
    
        if (precisionBound1 < precision && !v1.isInstanceOf[DenseVector]
          && !v2.isInstanceOf[DenseVector]) {
            sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
        } else if (v1.isInstanceOf[SparseVector] || 
v2.isInstanceOf[SparseVector]) {
          val dotValue = dot(v1, v2)
          sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
          val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * 
math.abs(dotValue)) /
            (sqDist + EPSILON)
          if (precisionBound2 > precision) {
            sqDist = Vectors.sqdist(v1, v2)
          }
        } else {
          sqDist = Vectors.sqdist(v1, v2)
        }
    `
    
    only use sqdist to calculate distance when the logic is presented above



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Reply via email to