> > The problem is likely the `vocabulary_` python dict of the > CountVectorizer. It's pickled using the default python pickler which > is probably not very efficient. > > Anyway for large text data, using a hashing vectorizer would be a much > better solution. > > You can follow progress on this branch that should be soon merged in > master: https://github.com/scikit-learn/scikit-learn/pull/909 > > And maybe later a HashingTextVectorizer that will directly take text > data as input and apply tokenization + token / char n-gram > vectorization using a hash function instead of a python dict to handle > the feature name to feature index mapping.
Sounds like a efficient way to vectorize the input. However I face the memory error when dumping the classifier object with compression on. [I already dump the vectorizer and target array with joblib.dump(compress=9) and that seems to go fine]. ------------------------------------------------------------------------------ Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
