[Scikit-learn-general] Overflow when vectorizing large corpus

Vlad Niculae Wed, 28 Aug 2013 13:03:33 -0700

Hi all,

I got an unexpected error with current master, when trying to run
TfidfVectorizer on a 2 billion token corpus.


/home/vniculae/envs/sklearn/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in _count_vocab(self,
raw_documents, fixed_vocab)

    728                     # Ignore out-of-vocabulary items for
fixed_vocab=True
    729                     continue

--> 730             indptr.append(len(j_indices))

    731

    732         if not fixed_vocab:



OverflowError: signed integer is greater than maximum

If I'm reading this correctly I'm getting too many features.

Do the indices/indptr arrays need to be int32 or is this a limitation of
the implementation?
I think that a quick workaround would be to limit the number of features
maybe through some preprocessing. Unfortunately I can't use the
HashingVectorizer because I'm interested in the columns.

As this happens in _count_vocab, it seems like it would be pointless to try
to increase min_df.

Cheers,
Vlad

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Overflow when vectorizing large corpus

Reply via email to