Hi!

> I found that about 75% of
> the time was spent in MiniBatchKMeans.fit, while the rest of it was
> spent inside nltk.word_tokenize (!)
>

I'm not sure how are you using it but something to take into account is
that the default NLTK tokenizer is meant to be used on sentences, not on
whole paragraphs or documents, so it should operate on the output of a
sentence tokenizer not on the raw text. Also it should be either pure ascii
or unicode, not encoded strings.

Another consideration is that by default it outputs tokens in Penn Treebank
compatible format, which might be an overkill for your use case. NLTK
provides simpler/faster tokenizers too, in case you want something more
than splitting on whitespace/punctuation but don't want to sacrifice a lot
of performance for it.

http://nltk.org/api/nltk.tokenize.html

Hope this helps!


On Fri, Jul 12, 2013 at 10:12 AM, Fred Mailhot <[email protected]>wrote:

> On 12 July 2013 09:48, Lars Buitinck <[email protected]> wrote:
>
>> 2013/7/11 Tom Fawcett <[email protected]>:
>> [...]
>>
>> I guess because it's terribly slow. I recently tried to cluster a
>> sample of Wikipedia text at the word level.
>>
>
> What kind of results did you get? I did some work recently clustering
> short-form text and was generally unimpressed with the results.
>
> I found that about 75% of
>> the time was spent in MiniBatchKMeans.fit, while the rest of it was
>> spent inside nltk.word_tokenize (!)
>>
>
> How does that compare to naively using Python's split()?
>
>
>>
>> --
>> Lars Buitinck
>> Scientific programmer, ILPS
>> University of Amsterdam
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to