2013/1/17 Andreas Mueller <[email protected]>: > On 01/17/2013 07:02 PM, Olivier Grisel wrote: >> This is a bug. >> >> Could you run the profiler (cProfile or line_profiler) on >> TfidfVectorizer on a slice of your data an post the output? >> >> http://scikit-learn.org/dev/developers/performance.html#profiling-python-code >> > Do you think this is specific to the input? > Or could we just do benchmarking on 20news?
I cannot reproduce on 20 newsgroups. The number of token is comparable but not the number of samples: >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> from sklearn.pipeline import Pipeline >>> from sklearn.feature_extraction.text import CountVectorizer >>> from sklearn.feature_extraction.text import TfidfTransformer >>> %time X = CountVectorizer().fit_transform(twenty.data) CPU times: user 10.67 s, sys: 0.25 s, total: 10.92 s Wall time: 10.92 s >>> print(repr(X)) <11314x56436 sparse matrix of type '<type 'numpy.int64'>' with 1713894 stored elements in COOrdinate format> >>> %time X_tfidf = TfidfTransformer().fit_transform(X) CPU times: user 0.92 s, sys: 0.05 s, total: 0.97 s Wall time: 0.97 s >>> print(repr(X_tfidf)) <11314x56436 sparse matrix of type '<type 'numpy.float64'>' with 1713894 stored elements in Compressed Sparse Row format> -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnmore_122712 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
