The problem is likely the `vocabulary_` python dict of the CountVectorizer. It's pickled using the default python pickler which is probably not very efficient.
Anyway for large text data, using a hashing vectorizer would be a much better solution. You can follow progress on this branch that should be soon merged in master: https://github.com/scikit-learn/scikit-learn/pull/909 And maybe later a HashingTextVectorizer that will directly take text data as input and apply tokenization + token / char n-gram vectorization using a hash function instead of a python dict to handle the feature name to feature index mapping. ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
