Let's move the discussion to JIRA. Thanks!

On Fri, Oct 7, 2016 at 8:43 PM, 王磊(安全部) <wangleikidd...@didichuxing.com>
wrote:

> https://issues.apache.org/jira/browse/SPARK-17825
>
> Actually I had created a JIRA. Could you let me your progress to avoid
> duplicated work.
>
> Thanks!
>
> 发件人: didi <wangleikidd...@didichuxing.com>
> 日期: 2016年10月8日 星期六 上午12:21
> 至: Yanbo Liang <yblia...@gmail.com>
>
> 抄送: "dev@spark.apache.org" <dev@spark.apache.org>, "u...@spark.apache.org"
> <u...@spark.apache.org>
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> Thanks for replying.
> When could you send out the PR?
>
> 发件人: Yanbo Liang <yblia...@gmail.com>
> 日期: 2016年10月7日 星期五 下午11:35
> 至: didi <wangleikidd...@didichuxing.com>
> 抄送: "dev@spark.apache.org" <dev@spark.apache.org>, "u...@spark.apache.org"
> <u...@spark.apache.org>
> 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB?
>
> It's a good question and I had similar requirement in my work. I'm copying
> the implementation from mllib to ml currently, and then exposing the
> maximum log likelihood. I will send this PR soon.
>
> Thanks.
> Yanbo
>
> On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) <wangleikidd...@didichuxing.com>
> wrote:
>
>>
>> Hi,
>>
>> Do you guys sometimes need to get the log likelihood of EM algorithm in
>> MLLIB?
>>
>> I mean the value in this line https://github.com/apache
>> /spark/blob/master/mllib/src/main/scala/org/apache/spark/
>> mllib/clustering/GaussianMixture.scala#L228
>>
>> Now copying the code here:
>>
>>
>> val sums = breezeData.treeAggregate(ExpectationSum.zero(k,
>> d))(compute.value, _ += _)
>> // Create new distributions based on the partial assignments
>> // (often referred to as the "M" step in literature)
>> val sumWeights = sums.weights.sum
>> if (shouldDistributeGaussians) {
>> val numPartitions = math.min(k, 1024)
>> val tuples =
>> Seq.tabulate(k)(i => (sums.means(i), sums.sigmas(i), sums.weights(i)))
>> val (ws, gs) = sc.parallelize(tuples, numPartitions).map { case (mean,
>> sigma, weight) =>
>> updateWeightsAndGaussians(mean, sigma, weight, sumWeights)
>> }.collect().unzip
>> Array.copy(ws.toArray, 0, weights, 0, ws.length)
>> Array.copy(gs.toArray, 0, gaussians, 0, gs.length)
>> } else {
>> var i = 0
>> while (i < k) {
>> val (weight, gaussian) =
>> updateWeightsAndGaussians(sums.means(i), sums.sigmas(i),
>> sums.weights(i), sumWeights)
>> weights(i) = weight
>> gaussians(i) = gaussian
>> i = i + 1
>> }
>> }
>> llhp = llh // current becomes previous
>> llh = sums.logLikelihood // this is the freshly computed log-likelihood
>> iter += 1
>> compute.destroy(blocking = false) In my application, I need to know log
>> likelihood to compare effect for different number of clusters.
>> And then I use the cluster number with the maximum log likelihood.
>>
>> Is it a good idea to expose this value?
>>
>>
>>
>>
>

Reply via email to