2013/7/12 Antonio Manuel Macías Ojeda <[email protected]>: > I'm not sure how are you using it but something to take into account is that > the default NLTK tokenizer is meant to be used on sentences, not on whole > paragraphs or documents, so it should operate on the output of a sentence > tokenizer not on the raw text. Also it should be either pure ascii or > unicode, not encoded strings. > > Another consideration is that by default it outputs tokens in Penn Treebank > compatible format, which might be an overkill for your use case. NLTK > provides simpler/faster tokenizers too, in case you want something more than > splitting on whitespace/punctuation but don't want to sacrifice a lot of > performance for it.
I know what the tokenizer does; I was in fact feeding it sentences because I was doing clustering at the word level and I needed precise tokenization. I also submitted a patch that made it twice as fast, which was pulled yesterday (https://github.com/nltk/nltk/pull/434). My general point is that NLTK is not written for speed. It's nice for learning, but its algorithms are rarely fast enough to be used online, and even in batch settings I tend to use it only for prototyping. >> On 12 July 2013 09:48, Lars Buitinck <[email protected]> wrote: >>> >>> 2013/7/11 Tom Fawcett <[email protected]>: >>> [...] >>> >>> I guess because it's terribly slow. I recently tried to cluster a >>> sample of Wikipedia text at the word level. >> >> What kind of results did you get? I did some work recently clustering >> short-form text and was generally unimpressed with the results. Pretty good results actually. I was clustering these words to get extra features for a NER tagger, which immediately got a boost in F1 score. >>> I found that about 75% of >>> the time was spent in MiniBatchKMeans.fit, while the rest of it was >>> spent inside nltk.word_tokenize (!) >> >> How does that compare to naively using Python's split()? Here's some profiling using the old word_tokenize; sentences.txt contains a million sentences sampled from Wikipedia. >>> from itertools import islice >>> lines = list(islice(open("sentences.txt"), 40000)) >>> %timeit map(str.split, lines) 10 loops, best of 3: 81.5 ms per loop >>> %timeit map(word_tokenize, lines) 1 loops, best of 3: 9.37 s per loop So it's 100 times as slow as str.split. I could understand a ten-fold difference, but this is really, really slow for something that nearly every NLP program has to do. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
