[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

srowen Fri, 02 Nov 2018 09:22:52 -0700

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    So the pull request right now doesn't reflect what you tested, but you 
tested the version pasted above. You're saying that the optimization just never 
helps the dense-dense case, and sqdist is faster than a dot product. This 
doesn't make sense mathematically as it should be more math, but stranger 
things have happened.
    
    Still, I don't follow your test code here. You parallelize one vector, map 
it, collect it: why use Spark? and it's the same vector over and over, and it's 
not a big vector. Your sparse vectors aren't very sparse.
    
    How about more representative input -- larger vectors (100s of elements, 
probably), more sparse sparse vectors, and a large set of different inputs. I 
also don't see where the precision bound is changed here?
    
    This may be a good change but I'm just not yet convinced by the test 
methodology, and the result still doesn't make much intuitive sense.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Reply via email to