[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323412#comment-14323412 ]
Travis Galoppo commented on SPARK-5016: --------------------------------------- Realistically, I think it will be very difficult to realize any performance increase from this modification. In particular, the algorithm simply will not work well in high enough dimension to make it worthwhile (from the numFeatures perspective, anyway) ... consider that the density of a Multivariate Gaussian will underflow EPSILON *at the mean* when numFeatures > -2 * log(EPSILON) / log(2*pi) ... this means 40 features will underflow 2.2204e-16 (eps in Octave on my laptop), and 131 features would underflow 1e-52; as the pdf approaches EPS, it will assign points uniformly to all clusters... so it breaks. These are not particularly large matrices ... I'm guessing the SVD time is too small to make the extra communication worthwhile. At a minimum, I would suggest some solid benchmarking to make sure this is a real improvement. > GaussianMixtureEM should distribute matrix inverse for large numFeatures, k > --------------------------------------------------------------------------- > > Key: SPARK-5016 > URL: https://issues.apache.org/jira/browse/SPARK-5016 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.2.0 > Reporter: Joseph K. Bradley > > If numFeatures or k are large, GMM EM should distribute the matrix inverse > computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org