Hi list,

A lot of clustering algorithms can be initiated randomely and thus on the
same data give different results because of the non-convexity of the
criterion.

One trivial source of non-reproducibility is the fact that labels can be
permuted: even if the algorithm find the same clusters, it may give
different labels to these. This renders testing and exploration harder,
but its easy to fix.

Indeed, if we use as a convention that as we consider training samples in
the ordering in which they are given, cluster labels are found in an
ordered way, all we need to do is to add the following line at the end of
the fit:

    labels = np.unique(labels, return_index=True)[1][labels]

provided that the labels id are not used elsewhere, of course.

I'd like to do this for kmeans, and maybe a few other algorithms where it
is really easy to do. What do people think?

Gael

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to