[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

KyleLi1985 Thu, 01 Nov 2018 21:47:12 -0700

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > Hm, actually that's the best case. You're exercising the case where the 
code path you prefer is fast. And the case where the precision bound applies is 
exactly the case where the branch you deleted helps. As I say, you'd have to 
show this is not impacting other cases significantly, and I think it should. 
Consider the sparse-sparse case.
    
    There is above test result:
    
    First we need define some branch path logic for sparse-sparse and 
sparse-dense case
    if meet precisionBound1, we define it as LOGIC1
    if not meet precisionBound1, and not meet precisionBound2, we define it as 
LOGIC2
    if not meet precisionBound1, but meet precisionBound2, we define it as 
LOGIC3
    (There is a trick, you can manually change the precision value to meet 
above situation)
    
    **sparse- sparse case time cost situation (milliseconds)**
    
    LOGIC1
    Before add patch:  7786, 7970, 8086
    After add patch: 7729, 7653, 7903
    LOGIC2
    Before add patch: 8412, 9029, 8606
    After add patch: 8603, 8724, 9024
    LOGIC3
    Before add patch:  19365, 19146, 19351
    After add patch:  18917, 19007, 19074
    
    **sparse-dense case time cost situation (milliseconds)**
    
    LOGIC1
    Before add patch: 4195, 4014, 4409
    After add patch: 4081,3971, 4151
    LOGIC2
    Before add patch: 4968, 5579, 5080
    After add patch: 4980, 5472, 5148
    LOGIC3
    Before add patch: 11848, 12077, 12168
    After add patch: 11718, 11874, 11743
    
    And for dense-dense case like we already discussed in comment, only use 
sqdist to calculate distance
    
    **dense-dense case time cost situation (milliseconds)**
    
    Before add patch: 7340, 7816, 7672
    After add patch: 5752, 5800, 5753
    
    The above result based on fastSquaredDistance which is showed below  
    
    `
    
        private[mllib] def fastSquaredDistance(
            v1: Vector,
            norm1: Double,
            v2: Vector,
            norm2: Double,
            precision: Double = 1e-6): Double = {
          val n = v1.size
          require(v2.size == n)
          require(norm1 >= 0.0 && norm2 >= 0.0)
          val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
          val normDiff = norm1 - norm2
          var sqDist = 0.0
          /*
           * The relative error is
           * <pre>
           * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
           * </pre>
           * which is bounded by
           * <pre>
           * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - 
\|b\|_2)^2 ).
           * </pre>
           * The bound doesn't need the inner product, so we can use it as a 
sufficient condition to
           * check quickly whether the inner product approach is accurate.
           */
          val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * 
normDiff + EPSILON)
      
          if (precisionBound1 < precision && (!v1.isInstanceOf[DenseVector]
            || !v2.isInstanceOf[DenseVector])) {
              sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
          } else if (v1.isInstanceOf[SparseVector] || 
v2.isInstanceOf[SparseVector]) {
            val dotValue = dot(v1, v2)
            sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
            val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * 
math.abs(dotValue)) /
              (sqDist + EPSILON)
            if (precisionBound2 > precision) {
              sqDist = Vectors.sqdist(v1, v2)
            }
          } else {
            sqDist = Vectors.sqdist(v1, v2)
          }
      
          sqDist
        }
    `



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Reply via email to