Otis,

I forgot to mention that I make use of Lucene for noun retrieval from the lexicon.

Pasquale

Pasquale Imbemba ha scritto:
Hi Otis,

I am completing my bachelor thesis at the Free University of Bolzano (www.unibz.it). My project is exactly about what you need: a word splitter for German compound words. Raffaella Bernardi who is reading in CC is my supervisor. As some from the lucene mailing list has already suggested, I have used the lexicon of German nouns extracted from Morphy (http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the splitting algorithm, I have used the one Maaten De Rijke and Christof Monz have published in /Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German and Italian /(website here <http://www.dcs.qmul.ac.uk/%7Echristof/>, document here <http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>). I did some testing and minor improvement on it (as I needed to "adjust" it for the solution I was working on) and could send you my thesis paper (actually still in draft state), which contains statistical data on correctness.

Let me know
Pasquale

Otis Gospodnetic ha scritto:
Hi,

How do people typically analyze/tokenize text with compounds (e.g. German)? I took a look at GermanAnalyzer hoping to see how one can deal with that, but it turns out GermanAnalyzer doesn't treat compounds in any special way at all.

One way to go about this is to have a word dictionary and a tokenizer that processes input one character at a time, looking for a word match in the dictionary after each processed characters. Then, CompoundWordLikeThis could be broken down into multiple tokens/words and returned at a set of tokens at the same position. However, somehow this doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to hear about it.

Thanks,
Otis



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
"As far as the laws of mathematics refer to reality, they are not certain, as far as 
they are certain, they do not refer to reality."

(Albert Einstein)


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to