After doing it again with pdb I figured out that it has nothing to do with
vocabulary size, which is decent; the list of indices simply grows too big.
Vlad
On Wed, Aug 28, 2013 at 11:01 PM, Vlad Niculae <[email protected]> wrote:
> Hi all,
>
> I got an unexpected error with current master, when trying to run
> TfidfVectorizer on a 2 billion token corpus.
>
> /home/vniculae/envs/sklearn/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in _count_vocab(self,
> raw_documents, fixed_vocab)
>
> 728 # Ignore out-of-vocabulary items for
> fixed_vocab=True
> 729 continue
>
> --> 730 indptr.append(len(j_indices))
>
> 731
>
> 732 if not fixed_vocab:
>
>
>
> OverflowError: signed integer is greater than maximum
>
> If I'm reading this correctly I'm getting too many features.
>
> Do the indices/indptr arrays need to be int32 or is this a limitation of
> the implementation?
> I think that a quick workaround would be to limit the number of features
> maybe through some preprocessing. Unfortunately I can't use the
> HashingVectorizer because I'm interested in the columns.
>
> As this happens in _count_vocab, it seems like it would be pointless to
> try to increase min_df.
>
> Cheers,
> Vlad
>
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general