2013/8/29 Olivier Grisel :
> 2013/8/28 Lars Buitinck :
>> This is a limit in scipy.sparse, which uses signed int for all its
>> indices. Effectively, the number of rows, columns and non-zeros are
>> each limited to 2^31-1. There was a pull request for 64-bit indices a
>> few months ago, but I don't
In the mean time I think I can use gensim for this kind of data by doing 2
passes. It's a pity as I suspect I could fit it in memory, but I wonder
whether even unsigned int64 would be enough. I'll do the math when I see
the final size of the matrix.
Thanks!
Vlad
On Thu, Aug 29, 2013 at 12:11 P
2013/8/28 Lars Buitinck :
> 2013/8/28 Vlad Niculae :
>> Do the indices/indptr arrays need to be int32 or is this a limitation of the
>> implementation?
>
> This is a limit in scipy.sparse, which uses signed int for all its
> indices. Effectively, the number of rows, columns and non-zeros are
> each
2013/8/28 Vlad Niculae :
> Do the indices/indptr arrays need to be int32 or is this a limitation of the
> implementation?
This is a limit in scipy.sparse, which uses signed int for all its
indices. Effectively, the number of rows, columns and non-zeros are
each limited to 2^31-1. There was a pull
After doing it again with pdb I figured out that it has nothing to do with
vocabulary size, which is decent; the list of indices simply grows too big.
Vlad
On Wed, Aug 28, 2013 at 11:01 PM, Vlad Niculae wrote:
> Hi all,
>
> I got an unexpected error with current master, when trying to run
> Tf
Hi all,
I got an unexpected error with current master, when trying to run
TfidfVectorizer on a 2 billion token corpus.
/home/vniculae/envs/sklearn/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in _count_vocab(self,
raw_documents, fixed_vocab)
728 #