Hi,
To use TfidfVectorizer, the whole corpus must be used into memory.
This can be a problem for machines without a lot of memory. Is there a
way to use only a small amount of memory by saving most intermediate
results in the disk? Thanks.
--
Regards,
Peng
___
See also https://www.aclweb.org/anthology/W18-2502/ for a critique of this
and other stop word lists.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
Hi Peng,
check out
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py
Best,
Sebastian
> On Jan 27, 2020, at 2:30 PM, Peng Yu wrote:
>
> Hi,
>
> I don't see what stopwords are used by CountVectorizer with
> stop_wordsstring = ‘english’.
>
> htt
Hi,
https://github.com/scikit-learn/scikit-learn/blob/b194674c42d54b26137a456c510c5fdba1ba23e0/sklearn/feature_extraction/_stop_words.py
Regards
Christian
Peng Yu schrieb am Mo., 27. Jan. 2020, 21:31:
> Hi,
>
> I don't see what stopwords are used by CountVectorizer with
> stop_wordsstring =
Hi Peng,
I believe the set of English stop words used across all token vectorizers
can be found in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/_stop_words.py.
Cheers,
Jon
On Mon, Jan 27, 2020 at 3:33 PM Peng Yu wrote:
> Hi,
>
> I don't see what stopword
Hi,
I don't see what stopwords are used by CountVectorizer with
stop_wordsstring = ‘english’.
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Is there a way to figure it out? Thanks.
--
Regards,
Peng
___