Re: [Scikit-learn-general] Sub sampling large datasets

Jean-Louis Durrieu Wed, 11 Apr 2012 02:23:32 -0700

Hi all,

On Feb 7, 2012, at 8:47 AM, Olivier Grisel wrote:


> 2012/2/6 Shishir Pandey <shishir...@gmail.com>:
>> 
>> I am working with a dataset which too big to fit in the memory. Is there a
>> way in scikits-learn to sub sample the existing dataset maintaining its
>> properties so that I can load it in my RAM?
> 
> We don't have any "smart" subsampler in scikit-learn (like a GMM core
> set extractor for instance). Do you have any specific algorithm in
> mind?

I was thinking it would be a good idea to include in gmm.py such a mechanism. 
One solution would be to load files (with features stored in npz files, for 
instance), and "accumulate" the sufficient statistics. As a matter of fact, 
hmm.py includes code that would make this very easy to implement (instead of a 
loop over the sequences in obs, one could loop over files in a directory). 

A further improvement would be to include some supervision, and train specific 
components by loading only the data with the correct label (in an HTK fashion). 

Not sure when I can find time to do anything like that, though... That also 
means quite some refactoring for gmm.py, but I think that's worth it! 

Best regards,
Jean-Louis
------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Sub sampling large datasets

Reply via email to