Yeah it's definitely not build with speed as it's design goal. Good patch!
On Fri, Jul 12, 2013 at 1:45 PM, Lars Buitinck <[email protected]> wrote:
> 2013/7/12 Antonio Manuel Macías Ojeda <[email protected]>:
> > I'm not sure how are you using it but something to take into account is
> that
> > the default NLTK tokenizer is meant to be used on sentences, not on whole
> > paragraphs or documents, so it should operate on the output of a sentence
> > tokenizer not on the raw text. Also it should be either pure ascii or
> > unicode, not encoded strings.
> >
> > Another consideration is that by default it outputs tokens in Penn
> Treebank
> > compatible format, which might be an overkill for your use case. NLTK
> > provides simpler/faster tokenizers too, in case you want something more
> than
> > splitting on whitespace/punctuation but don't want to sacrifice a lot of
> > performance for it.
>
> I know what the tokenizer does; I was in fact feeding it sentences
> because I was doing clustering at the word level and I needed precise
> tokenization. I also submitted a patch that made it twice as fast,
> which was pulled yesterday (https://github.com/nltk/nltk/pull/434).
>
> My general point is that NLTK is not written for speed. It's nice for
> learning, but its algorithms are rarely fast enough to be used online,
> and even in batch settings I tend to use it only for prototyping.
>
> >> On 12 July 2013 09:48, Lars Buitinck <[email protected]> wrote:
> >>>
> >>> 2013/7/11 Tom Fawcett <[email protected]>:
> >>> [...]
> >>>
> >>> I guess because it's terribly slow. I recently tried to cluster a
> >>> sample of Wikipedia text at the word level.
> >>
> >> What kind of results did you get? I did some work recently clustering
> >> short-form text and was generally unimpressed with the results.
>
> Pretty good results actually. I was clustering these words to get
> extra features for a NER tagger, which immediately got a boost in F1
> score.
>
> >>> I found that about 75% of
> >>> the time was spent in MiniBatchKMeans.fit, while the rest of it was
> >>> spent inside nltk.word_tokenize (!)
> >>
> >> How does that compare to naively using Python's split()?
>
> Here's some profiling using the old word_tokenize; sentences.txt
> contains a million sentences sampled from Wikipedia.
>
> >>> from itertools import islice
> >>> lines = list(islice(open("sentences.txt"), 40000))
> >>> %timeit map(str.split, lines)
> 10 loops, best of 3: 81.5 ms per loop
> >>> %timeit map(word_tokenize, lines)
> 1 loops, best of 3: 9.37 s per loop
>
> So it's 100 times as slow as str.split. I could understand a ten-fold
> difference, but this is really, really slow for something that nearly
> every NLP program has to do.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general