It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon.
Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wangleikidd...@didichuxing.com> wrote: > > Hi, > > Do you guys sometimes need to get the log likelihood of EM algorithm in > MLLIB? > > I mean the value in this line https://github.com/apache/spark/blob/master/ > mllib/src/main/scala/org/apache/spark/mllib/clustering/ > GaussianMixture.scala#L228 > > Now copying the code here: > > > val sums = breezeData.treeAggregate(ExpectationSum.zero(k, > d))(compute.value, _ += _) > // Create new distributions based on the partial assignments > // (often referred to as the "M" step in literature) > val sumWeights = sums.weights.sum > if (shouldDistributeGaussians) { > val numPartitions = math.min(k, 1024) > val tuples = > Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i))) > val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean, > sigma, weight) => > updateWeightsAndGaussians(mean, sigma, weight, sumWeights) > }.collect().unzip > Array.copy(ws.toArray, 0, weights, 0, ws.length) > Array.copy(gs.toArray, 0, gaussians, 0, gs.length) > } else { > var i = 0 > while (i < k) { > val (weight, gaussian) = > updateWeightsAndGaussians(sums.means(i), sums.sigmas(i), sums.weights(i), > sumWeights) > weights(i) = weight > gaussians(i) = gaussian > i = i + 1 > } > } > llhp = llh // current becomes previous > llh = sums.logLikelihood // this is the freshly computed log-likelihood > iter += 1 > compute.destroy(blocking = false) In my application, I need to know log > likelihood to compare effect for different number of clusters. > And then I use the cluster number with the maximum log likelihood. > > Is it a good idea to expose this value? > > > >