I agree with Fabian that this might be a bug, as these models are still relatively untested, and I'd love to see a testing script showing the errors.
On Sat, Oct 8, 2011 at 09:21, Martin Fergie <[email protected]> wrote: > Hi, > > I've been experimenting with the variational clustering method introduced in > the latest version of scikits-learn. I'm having trouble getting these models > to fit properly. I've been experimenting with two small data sets, one is > the 'old faithful' data set [1], and the other is a 4 component data set > from [2]. I'm using the script given on the scikits website [3] but have > replaced the example data with the data sets above. > > Clustering using EM (with mixture.GMM) seems to give reasonably reliable > results on both data sets. However when I use DPGMM and VBGMM the clusters > are heavily biased towards 0, and often over generalise. What is more > concerning, is that the component weights don't appear to change during > training. For example, a 2 component DPGMM/VBGMM will have weights = [0.5 > 0.5] where as the GMM will have weights = [0.64, 0.36]. > Both models behave like this with default initialisation parameters and I > have tried a range of alphas. > > I have a matlab implemention of variational Bayes EM (non Dirichlet process) > which is able to cluster this data effectively. > > Does anyone have any experience with these models and may be able to shed > some light on the problems I am having? I can send a tar of the code/data > I'm using to anyone who is interested. > > Thanks, for such a useful toolkit! > > Martin > > [1] Old faithful dataset: Glancing at these data it seems that the variables live in completely different orders of magnitude. Have you tried scaling/centering the data? Because of the way priors are used the DPGMM and VBGMM models are unfortunately biased towards zero, as you noticed, and this might be part of the reason why bad things are happening. Also, have you tried using more components to see what happens? > http://research.microsoft.com/en-us/um/people/cmbishop/prml/webdatasets/datasets.htm > [2] Figueiredo and Jain, Unsupervised Learning of Finite Mixture Models, > PAMI 2002 > [3] > http://scikit-learn.sourceforge.net/stable/auto_examples/mixture/plot_gmm.html#example-mixture-plot-gmm-py > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2dcopy2 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- - Alexandre ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
