[scikit-learn] Memory efficient TfidfVectorizer

2020-01-27 Thread Peng Yu
Hi, To use TfidfVectorizer, the whole corpus must be used into memory. This can be a problem for machines without a lot of memory. Is there a way to use only a small amount of memory by saving most intermediate results in the disk? Thanks. -- Regards, Peng ___

Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Joel Nothman
See also https://www.aclweb.org/anthology/W18-2502/ for a critique of this and other stop word lists. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Sebastian Raschka
Hi Peng, check out https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py Best, Sebastian > On Jan 27, 2020, at 2:30 PM, Peng Yu wrote: > > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop_wordsstring = ‘english’. > > htt

Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Christian Braune
Hi, https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/feature_extraction/_stop_words.py Regards Christian Peng Yu schrieb am Mo., 27. Jan. 2020, 21:31: > Hi, > > I don't see what stopwords are used by CountVectorizer with > stop_wordsstring =

Re: [scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Jonathan Cusick
Hi Peng, I believe the set of English stop words used across all token vectorizers can be found in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py. Cheers, Jon On Mon, Jan 27, 2020 at 3:33 PM Peng Yu wrote: > Hi, > > I don't see what stopword

[scikit-learn] What are the stopwords used by CountVectorizer?

2020-01-27 Thread Peng Yu
Hi, I don't see what stopwords are used by CountVectorizer with stop_wordsstring = ‘english’. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html Is there a way to figure it out? Thanks. -- Regards, Peng ___