Hi all, On Feb 7, 2012, at 8:47 AM, Olivier Grisel wrote:
> 2012/2/6 Shishir Pandey <shishir...@gmail.com>: >> >> I am working with a dataset which too big to fit in the memory. Is there a >> way in scikits-learn to sub sample the existing dataset maintaining its >> properties so that I can load it in my RAM? > > We don't have any "smart" subsampler in scikit-learn (like a GMM core > set extractor for instance). Do you have any specific algorithm in > mind? I was thinking it would be a good idea to include in gmm.py such a mechanism. One solution would be to load files (with features stored in npz files, for instance), and "accumulate" the sufficient statistics. As a matter of fact, hmm.py includes code that would make this very easy to implement (instead of a loop over the sequences in obs, one could loop over files in a directory). A further improvement would be to include some supervision, and train specific components by loading only the data with the correct label (in an HTK fashion). Not sure when I can find time to do anything like that, though... That also means quite some refactoring for gmm.py, but I think that's worth it! Best regards, Jean-Louis ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general