Re: [Scikit-learn-general] Overflow when vectorizing large corpus

2013-08-29 Thread Lars Buitinck
2013/8/29 Olivier Grisel : > 2013/8/28 Lars Buitinck : >> This is a limit in scipy.sparse, which uses signed int for all its >> indices. Effectively, the number of rows, columns and non-zeros are >> each limited to 2^31-1. There was a pull request for 64-bit indices a >> few months ago, but I don't

Re: [Scikit-learn-general] Overflow when vectorizing large corpus

2013-08-29 Thread Vlad Niculae
In the mean time I think I can use gensim for this kind of data by doing 2 passes. It's a pity as I suspect I could fit it in memory, but I wonder whether even unsigned int64 would be enough. I'll do the math when I see the final size of the matrix. Thanks! Vlad On Thu, Aug 29, 2013 at 12:11 P

Re: [Scikit-learn-general] Overflow when vectorizing large corpus

2013-08-29 Thread Olivier Grisel
2013/8/28 Lars Buitinck : > 2013/8/28 Vlad Niculae : >> Do the indices/indptr arrays need to be int32 or is this a limitation of the >> implementation? > > This is a limit in scipy.sparse, which uses signed int for all its > indices. Effectively, the number of rows, columns and non-zeros are > each

Re: [Scikit-learn-general] Overflow when vectorizing large corpus

2013-08-28 Thread Lars Buitinck
2013/8/28 Vlad Niculae : > Do the indices/indptr arrays need to be int32 or is this a limitation of the > implementation? This is a limit in scipy.sparse, which uses signed int for all its indices. Effectively, the number of rows, columns and non-zeros are each limited to 2^31-1. There was a pull

Re: [Scikit-learn-general] Overflow when vectorizing large corpus

2013-08-28 Thread Vlad Niculae
After doing it again with pdb I figured out that it has nothing to do with vocabulary size, which is decent; the list of indices simply grows too big. Vlad On Wed, Aug 28, 2013 at 11:01 PM, Vlad Niculae wrote: > Hi all, > > I got an unexpected error with current master, when trying to run > Tf

[Scikit-learn-general] Overflow when vectorizing large corpus

2013-08-28 Thread Vlad Niculae
Hi all, I got an unexpected error with current master, when trying to run TfidfVectorizer on a 2 billion token corpus. /home/vniculae/envs/sklearn/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab) 728 #