[GitHub] spark issue #14937: [WIP] [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KM...

srowen Sun, 04 Sep 2016 04:32:41 -0700

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/14937
  
    @yanboliang here are a few other changes I made in my PR that accidentally 
duplicated some of this work. Refer to 
https://github.com/apache/spark/pull/14948 for details. For your consideration:
    
    I think getRuns/setRuns should be formally deprecated and the runs param to 
the constructor removed (it's private). 
    
    There are some mentions of 'runs' in the docs that should be removed too at 
this point.
    
    mergeContribs and the "type WeightedPoint" don't really serve a purpose at 
this point and can be 'inlined' IMHO.
    
    Minor: the "contribs.iterator" can really be an iterator only over triples 
with non-zero counts, which eliminates the filtering by 0 counts
    
    The "run finished" log message is obsolete now.
    
    Minor, but in k-means|| the sample of 1 element is very slightly better if 
it's without replacement. Won't matter much but otherwise you might sample a 
couple elements.
    
    pointsWithCosts.flatMap might be a little faster as filter + map instead 
because virtually every element is filtered out.
    
    mergeNewCenters() is pretty superfluous, because it's simpler to compute 
newCenters, then add it to centers, in the same loop. No clear() or multiple 
calls to update this.
    
    weightMap can be computed with countByValue directly



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14937: [WIP] [SPARK-8519][SPARK-11560] [ML] [MLlib] Optimize KM...

Reply via email to