Re: [Scikit-learn-general] Overflow when vectorizing large corpus

Vlad Niculae Wed, 28 Aug 2013 14:18:59 -0700

After doing it again with pdb I figured out that it has nothing to do with
vocabulary size, which is decent; the list of indices simply grows too big.


Vlad


On Wed, Aug 28, 2013 at 11:01 PM, Vlad Niculae <[email protected]> wrote:

> Hi all,
>
> I got an unexpected error with current master, when trying to run
> TfidfVectorizer on a 2 billion token corpus.
>
> /home/vniculae/envs/sklearn/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in _count_vocab(self,
> raw_documents, fixed_vocab)
>
>     728                     # Ignore out-of-vocabulary items for
> fixed_vocab=True
>     729                     continue
>
> --> 730             indptr.append(len(j_indices))
>
>     731
>
>     732         if not fixed_vocab:
>
>
>
> OverflowError: signed integer is greater than maximum
>
> If I'm reading this correctly I'm getting too many features.
>
> Do the indices/indptr arrays need to be int32 or is this a limitation of
> the implementation?
> I think that a quick workaround would be to limit the number of features
> maybe through some preprocessing. Unfortunately I can't use the
> HashingVectorizer because I'm interested in the columns.
>
> As this happens in _count_vocab, it seems like it would be pointless to
> try to increase min_df.
>
> Cheers,
> Vlad
>

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Overflow when vectorizing large corpus

Reply via email to