Nothing is better than asking help to find the answer by myself...

Page 47 of the technical report (tr504.pdf) deals exactly with the problem of big datasets.

Also I found that mclust is too much for my problem, the optimum number of Gaussian suggested is way too high. For example for one dataset (downsampled to 1/10) it suggests 9 Gaussian, but the central 7 sum with good approximation to a single Gaussian, so the dataset is better decomposed into only 3 Gaussian.
I admit I'm not rigorous at all...

Bye!
                  mario

Mario Valle wrote:
Hi all!

I have an ordered vector of values. The distribution of these values can be modeled by a sum of Gaussians. So I'm using the package 'mclust' to get the Gaussians's parameters for this 1D distribution. It works very well, but, for input sizes above 100.000 values it starts taking really forever. Unfortunately my dataset has around 4.6M values...

My question: is it correct to subsample my dataset taking a value every N to make mclust happy? Or have I no alternative except using the complete dataset?

Excuse my profound ignorance and thank for your help!
mario



--
Ing. Mario Valle
Data Analysis and Visualization Group            | http://www.cscs.ch/~mvalle
Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91) 610.82.60
v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91) 610.82.82

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to