[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323412#comment-14323412
 ] 

Travis Galoppo commented on SPARK-5016:
---------------------------------------

Realistically, I think it will be very difficult to realize any performance 
increase from this modification.  In particular, the algorithm simply will not 
work well in high enough dimension to make it worthwhile (from the numFeatures 
perspective, anyway) ... consider that the density of a Multivariate Gaussian 
will underflow EPSILON *at the mean* when numFeatures > -2 * log(EPSILON) / 
log(2*pi) ... this means 40 features will underflow 2.2204e-16 (eps in Octave 
on my laptop), and 131 features would underflow 1e-52; as the pdf approaches 
EPS, it will assign points uniformly to all clusters... so it breaks.  These 
are not particularly large matrices ... I'm guessing the SVD time is too small 
to make the extra communication worthwhile.  At a minimum, I would suggest some 
solid benchmarking to make sure this is a real improvement.


> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-5016
>                 URL: https://issues.apache.org/jira/browse/SPARK-5016
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to