Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

Lars Buitinck Fri, 12 Jul 2013 13:47:59 -0700

2013/7/12 Antonio Manuel Macías Ojeda <[email protected]>:
> I'm not sure how are you using it but something to take into account is that
> the default NLTK tokenizer is meant to be used on sentences, not on whole
> paragraphs or documents, so it should operate on the output of a sentence
> tokenizer not on the raw text. Also it should be either pure ascii or
> unicode, not encoded strings.
>
> Another consideration is that by default it outputs tokens in Penn Treebank
> compatible format, which might be an overkill for your use case. NLTK
> provides simpler/faster tokenizers too, in case you want something more than
> splitting on whitespace/punctuation but don't want to sacrifice a lot of
> performance for it.

I know what the tokenizer does; I was in fact feeding it sentences
because I was doing clustering at the word level and I needed precise
tokenization. I also submitted a patch that made it twice as fast,
which was pulled yesterday (https://github.com/nltk/nltk/pull/434).

My general point is that NLTK is not written for speed. It's nice for
learning, but its algorithms are rarely fast enough to be used online,
and even in batch settings I tend to use it only for prototyping.

>> On 12 July 2013 09:48, Lars Buitinck <[email protected]> wrote:
>>>
>>> 2013/7/11 Tom Fawcett <[email protected]>:
>>> [...]
>>>
>>> I guess because it's terribly slow. I recently tried to cluster a
>>> sample of Wikipedia text at the word level.
>>
>> What kind of results did you get? I did some work recently clustering
>> short-form text and was generally unimpressed with the results.

Pretty good results actually. I was clustering these words to get
extra features for a NER tagger, which immediately got a boost in F1
score.

>>> I found that about 75% of
>>> the time was spent in MiniBatchKMeans.fit, while the rest of it was
>>> spent inside nltk.word_tokenize (!)
>>
>> How does that compare to naively using Python's split()?

Here's some profiling using the old word_tokenize; sentences.txt
contains a million sentences sampled from Wikipedia.

>>> from itertools import islice
>>> lines = list(islice(open("sentences.txt"), 40000))
>>> %timeit map(str.split, lines)
10 loops, best of 3: 81.5 ms per loop
>>> %timeit map(word_tokenize, lines)
1 loops, best of 3: 9.37 s per loop

So it's 100 times as slow as str.split. I could understand a ten-fold
difference, but this is really, really slow for something that nearly
every NLP program has to do.

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

Reply via email to