[ https://issues.apache.org/jira/browse/SPARK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15696030#comment-15696030 ]
Sean Owen commented on SPARK-18581: ----------------------------------- The PDF largest at the mean, and it can be > 1 if the determinant of the covariance matrix is sufficiently small. This is like the univariate case where the variance is small - the distribution is very "peaked" at the mean and the PDF gets arbitrarily high. See for example https://www.wolframalpha.com/input/?i=Plot+N(0,1e-10) Can you compare this to results you might get with R or something to see if the numbers match? numerical accuracy does become an issue as the matrix gets near-singular but that's what this cutoff is supposed to help address. We might be able to rearrange some of the math for better accuracy too. But let's first verify there's an issue. > MultivariateGaussian does not check if covariance matrix is invertible > ---------------------------------------------------------------------- > > Key: SPARK-18581 > URL: https://issues.apache.org/jira/browse/SPARK-18581 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.6.2, 2.0.2 > Reporter: Hao Ren > > When training GaussianMixtureModel, I found some probability much larger than > 1. That leads me to that fact that, the value returned by > MultivariateGaussian.pdf can be 10^5, etc. > After reviewing the code, I found that problem lies in the computation of > determinant of the covariance matrix. > The computation is simplified by using pseudo-determinant of a positive > defined matrix. > In my case, I have a feature = 0 for all data point. > As a result, covariance matrix is not invertible <=> det(covariance matrix) = > 0 => pseudo-determinant will be very close to zero, > Thus, log(pseudo-determinant) will be a large negative number which finally > make logpdf very biger, pdf will be even bigger > 1. > As said in comments of MultivariateGaussian.scala, > """ > Singular values are considered to be non-zero only if they exceed a tolerance > based on machine precision. > """ > But if a singular value is considered to be zero, means the covariance matrix > is non invertible which is a contradiction to the assumption that it should > be invertible. > So we should check if there a single value is smaller than the tolerance > before computing the pseudo determinant -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org