> 
> The problem is likely the `vocabulary_` python dict of the
> CountVectorizer. It's pickled using the default python pickler which
> is probably not very efficient.
> 
> Anyway for large text data, using a hashing vectorizer would be a much
> better solution.
> 
> You can follow progress on this branch that should be soon merged in
> master: https://github.com/scikit-learn/scikit-learn/pull/909
> 
> And maybe later a HashingTextVectorizer that will directly take text
> data as input and apply tokenization + token / char n-gram
> vectorization using a hash function instead of a python dict to handle
> the feature name to feature index mapping.

Sounds like a efficient way to vectorize the input. However I face the memory
 error when dumping the classifier object with compression on. [I already dump
 the vectorizer and target array with joblib.dump(compress=9) and that seems to
 go fine].


------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to