Hi Otis,

I am completing my bachelor thesis at the Free University of Bolzano (www.unibz.it). My project is exactly about what you need: a word splitter for German compound words. Raffaella Bernardi who is reading in CC is my supervisor. As some from the lucene mailing list has already suggested, I have used the lexicon of German nouns extracted from Morphy (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the splitting algorithm, I have used the one Maaten De Rijke and Christof Monz have published in /Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German and Italian /(website here <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here <http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). I did some testing and minor improvement on it (as I needed to "adjust" it for the solution I was working on) and could send you my thesis paper (actually still in draft state), which contains statistical data on correctness.

Let me know
Pasquale

Otis Gospodnetic ha scritto:
Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)?  I 
took a look at GermanAnalyzer hoping to see how one can deal with that, but it 
turns out GermanAnalyzer doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that 
processes input one character at a time, looking for a word match in the 
dictionary after each processed characters.  Then, CompoundWordLikeThis could 
be broken down into multiple tokens/words and returned at a set of tokens at 
the same position.  However, somehow this doesn't strike me as a very smart and 
fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to 
hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
"As far as the laws of mathematics refer to reality, they are not certain, as far as 
they are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to