Re: [scikit-learn] Memory efficient TfidfVectorizer

2020-01-28 Thread Peng Yu
> Are you concerned about storing the whole corpus text in memory, or the > whole corpus' statistics? If the text, use input='file' or input='filename' > (or a generator of texts). I am not really sure which stage takes the most memory as my program kills itself due to memory limitation. But I sus

Re: [scikit-learn] Memory efficient TfidfVectorizer

2020-01-28 Thread Joel Nothman
Are you concerned about storing the whole corpus text in memory, or the whole corpus' statistics? If the text, use input='file' or input='filename' (or a generator of texts). On Tue, 28 Jan 2020 at 18:01, Peng Yu wrote: > Hi, > > To use TfidfVectorizer, the whole corpus must be used into memory.

[scikit-learn] Memory efficient TfidfVectorizer

2020-01-27 Thread Peng Yu
Hi, To use TfidfVectorizer, the whole corpus must be used into memory. This can be a problem for machines without a lot of memory. Is there a way to use only a small amount of memory by saving most intermediate results in the disk? Thanks. -- Regards, Peng ___