Re: Analysis/tokenization of compound words

Marvin Humphrey Tue, 19 Sep 2006 09:52:54 -0700


On Sep 19, 2006, at 9:21 AM, Otis Gospodnetic wrote:

How do people typically analyze/tokenize text with compounds (e.g.German)? I took a look at GermanAnalyzer hoping to see how one candeal with that, but it turns out GermanAnalyzer doesn't treatcompounds in any special way at all.
One way to go about this is to have a word dictionary and atokenizer that processes input one character at a time, looking fora word match in the dictionary after each processed characters.Then, CompoundWordLikeThis could be broken down into multipletokens/words and returned at a set of tokens at the same position.However, somehow this doesn't strike me as a very smart and fastapproach.

This came up on the KinoSearch list a few weeks ago, and bestsolution I could think of used essentially the same algorithm youdescribe.


During the discussion, we found this:

http://www.glue.umd.edu/~oard/courses/708a/fall01/838/P2/

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analysis/tokenization of compound words

Reply via email to