Re: [Scikit-learn-general] Ordering of probabilities returned by DPGMM.predict_proba and .eval

Kasper Thofte Fri, 27 Jul 2012 07:10:16 -0700

No I still can't figure it out =)

What I would like to do after fitting is simply to run through the
components, and for each component find the vector "closest" to the the
mean. This can either be done by some metric (e.g. eucledian) og by
selecting the datum with the highest probability or responsibility for that
component.


How might this be done in scikit-learn?

My own code is here:

dpgmm = mixture.DPGMM(n_components=NCOMP, alpha=4.)
dpgmm.fit(data)

probs = dpgmm.predict_proba(data)

local_best = []
for i in range(NCOMP):
        local_best.append( data[ list(probs_i).index( max( probs[:][i] ) )
] )

where local_best should then be each of the data lying closest to the
component with the same index. But they're all the same.

What am I doing wrong?

Kasper

On Wed, Jul 25, 2012 at 4:58 PM, Kasper Thofte <[email protected]>wrote:

> Hey
>
> Thanks Alex and Olivier, I figured that might be the "issue".
>
>
> On Wed, Jul 25, 2012 at 4:46 PM, Alexandre Passos 
> <[email protected]>wrote:
>
>> On Wed, Jul 25, 2012 at 7:43 AM, Olivier Grisel
>> <[email protected]> wrote:
>> > 2012/7/25 Alexandre Passos <[email protected]>:
>> >> On Wed, Jul 25, 2012 at 2:53 AM, Olivier Grisel
>> >> <[email protected]> wrote:
>> >>> Hi Alex,
>> >>>
>> >>> I am forwarding you this question as I am not sure your are following
>> >>> the mailing list.
>> >>
>> >> You're right, thanks.
>> >>
>> >>>
>> >>> 2012/7/25 Kasper Thofte <[email protected]>:
>> >>>> Hi
>> >>>>
>> >>>> I am using the DPGMM for clustering short sequences of integers.
>> >>>>
>> >>>> In my application, I need the datapoint that is in some sense
>> closest to the
>> >>>> cluster mean, for each cluster.
>> >>>>
>> >>>> Conforming to the interface of scikit-learn, I opted to use the
>> >>>> predict_proba(X), where X is the data, then selecting for each
>> component,
>> >>>> the datum with highest probability.
>> >>>>
>> >>>> However, it seems that predict_proba (and apparently also eval(X))
>> returns
>> >>>> the arrays of probabilities in decreasing order instead of
>> corresponding to
>> >>>> the order of the components? Is this really the order of the
>> components?
>> >>>>
>> >>>> I am a little confused by this. Can someone clear this issue up?
>> >>
>> >> In the Dirichlet process prior there is this phenomenon called
>> >> "rich-get-richer", which means points tend to be assigned more often
>> >> to the "bigger" clusters, all else being equal. This is the thing that
>> >> makes it efficient to deal with an unbounded number of clusters: most
>> >> points will go to the bigger groups anyway, so it doesn't really
>> >> matter how many small ones are there.
>> >>
>> >> In the scikit implementation the bigger clusters are always in the
>> >> first positions of the array, but the array returned is really sorted
>> >> by cluster index and not by other things.
>> >>
>> >> If you feel like everything is falling into the big cluster try
>> >> changing the "alpha" concentration hyperparameter to a value where
>> >> things are more evenly spread out.
>> >
>> > Thanks Alex,
>> >
>> > That would be great to add those considerations in "Practical tips"
>> > section of the narrative documentation:
>> >
>> >
>> http://scikit-learn.org/dev/modules/mixture.html#dpgmm-classifier-infinite-gaussian-mixtures
>> >
>> > It's already partially implied but could be made more explicit.
>>
>> Ok. I'll leave this message stared and get to it when I have more time.
>>
>> --
>>  - Alexandre
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Ordering of probabilities returned by DPGMM.predict_proba and .eval

Reply via email to