Hi all,
I got an unexpected error with current master, when trying to run
TfidfVectorizer on a 2 billion token corpus.
/home/vniculae/envs/sklearn/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in _count_vocab(self,
raw_documents, fixed_vocab)
728 # Ignore out-of-vocabulary items for
fixed_vocab=True
729 continue
--> 730 indptr.append(len(j_indices))
731
732 if not fixed_vocab:
OverflowError: signed integer is greater than maximum
If I'm reading this correctly I'm getting too many features.
Do the indices/indptr arrays need to be int32 or is this a limitation of
the implementation?
I think that a quick workaround would be to limit the number of features
maybe through some preprocessing. Unfortunately I can't use the
HashingVectorizer because I'm interested in the columns.
As this happens in _count_vocab, it seems like it would be pointless to try
to increase min_df.
Cheers,
Vlad
------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general