The problem is likely the `vocabulary_` python dict of the
CountVectorizer. It's pickled using the default python pickler which
is probably not very efficient.

Anyway for large text data, using a hashing vectorizer would be a much
better solution.

You can follow progress on this branch that should be soon merged in
master: https://github.com/scikit-learn/scikit-learn/pull/909

And maybe later a HashingTextVectorizer that will directly take text
data as input and apply tokenization + token / char n-gram
vectorization using a hash function instead of a python dict to handle
the feature name to feature index mapping.

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to