2013/1/17 Andreas Mueller <[email protected]>:
> On 01/17/2013 07:02 PM, Olivier Grisel wrote:
>> This is a bug.
>>
>> Could you run the profiler (cProfile or line_profiler) on
>> TfidfVectorizer on a slice of your data an post the output?
>>
>> http://scikit-learn.org/dev/developers/performance.html#profiling-python-code
>>
> Do you think this is specific to the input?
> Or could we just do benchmarking on 20news?

I cannot reproduce on 20 newsgroups. The number of token is comparable
but not the number of samples:

>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.feature_extraction.text import TfidfTransformer

>>> %time X = CountVectorizer().fit_transform(twenty.data)
CPU times: user 10.67 s, sys: 0.25 s, total: 10.92 s
Wall time: 10.92 s

>>> print(repr(X))
<11314x56436 sparse matrix of type '<type 'numpy.int64'>'
with 1713894 stored elements in COOrdinate format>

>>> %time X_tfidf = TfidfTransformer().fit_transform(X)
CPU times: user 0.92 s, sys: 0.05 s, total: 0.97 s
Wall time: 0.97 s

>>> print(repr(X_tfidf))
<11314x56436 sparse matrix of type '<type 'numpy.float64'>'
with 1713894 stored elements in Compressed Sparse Row format>


--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122712
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to