[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

KyleLi1985 Fri, 02 Nov 2018 23:01:59 -0700

Github user KyleLi1985 commented on the issue:

    https://github.com/apache/spark/pull/22893
  
    > So the pull request right now doesn't reflect what you tested, but you 
tested the version pasted above. You're saying that the optimization just never 
helps the dense-dense case, and sqdist is faster than a dot product. This 
doesn't make sense mathematically as it should be more math, but stranger 
things have happened.
    > 
    > Still, I don't follow your test code here. You parallelize one vector, 
map it, collect it: why use Spark? and it's the same vector over and over, and 
it's not a big vector. Your sparse vectors aren't very sparse.
    > 
    > How about more representative input -- larger vectors (100s of elements, 
probably), more sparse sparse vectors, and a large set of different inputs. I 
also don't see where the precision bound is changed here?
    > 
    > This may be a good change but I'm just not yet convinced by the test 
methodology, and the result still doesn't make much intuitive sense.
    
    1) why use Spark? not for special reason, only align with my common using 
tool. 
    
    2) About the vector, I did a more representative input test, I show this 
result below
    
    3) About the precision, it is trick,  you can meet your goal (let your 
calculation logic into which branch) by manually change it.  As I said in last 
comment, take LOGIC2 for example, you can manually change precision to -10000  
in ( precisionbound1 < precision) and change precision to 10000 in 
(precisionbound2 > precision), so you calculation login will into LOGIC2 
situation.  It is like codecoverage thing.  Anyway, we goal is to show the 
performance will not change in same calculation logic before and after added 
Enhance for sparse-sparse and sparse-dense situation.
    
    There is my test file
    
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2544667/SparkMLlibTest.txt)
    
    There is my test data situation
    I use the data 
    
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
    extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data
    
    total instances are 13230
    the attributes for line are 6000
    
    **Result for sparse-sparse situation time cost (milliseconds)**
    Before Enhance:  7670, 7704, 7652
    After Enhance: 7634, 7729, 7645




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

Reply via email to