Re: [Scikit-learn-general] Machine Learning on Large Data Sets

Olivier Grisel Sat, 17 Nov 2012 08:39:17 -0800

No:

- either you use models that can stream on the data without loading
everything in memory at once by using the models that support the
`partial_fit` API as explained above (which is not the case for
tree-based models but would work for Perceptron, SGDClassifier or
PassiveAggressiveClassifier)


or alternatively:

- your learn several random forest models on subsets of your data that
fit in memory and then assemble them in one.

We don't have any example of the later in scikit-learn to you would
have to read the source code of the RandomForestClassifier or
ExtraTreesClassifier to understand how they assemble the
sub-estimators to make this strategy work.

For the partial_fit strategy to work you would also need a feature
extractor that can work in a streaming manner. If your input data is
text, be aware that this is not the case for CountVectorizer available
in scikit-learn as I explained in the previously linked stackoverflow
answer.

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Machine Learning on Large Data Sets

Reply via email to