Otis,
I forgot to mention that I make use of Lucene for noun retrieval from
the lexicon.
Pasquale
Pasquale Imbemba ha scritto:
Hi Otis,
I am completing my bachelor thesis at the Free University of Bolzano
(www.unibz.it). My project is exactly about what you need: a word
splitter for German compound words. Raffaella Bernardi who is reading
in CC is my supervisor.
As some from the lucene mailing list has already suggested, I have
used the lexicon of German nouns extracted from Morphy
(http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for
the splitting algorithm, I have used the one Maaten De Rijke and
Christof Monz have published in /Shallow Morphological Analysis in
Monolingual
Information Retrieval for Dutch, German and Italian /(website here
<http://www.dcs.qmul.ac.uk/%7Echristof/>, document here
<http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>).
I did some testing and minor improvement on it (as I needed to
"adjust" it for the solution I was working on) and could send you my
thesis paper (actually still in draft state), which contains
statistical data on correctness.
Let me know
Pasquale
Otis Gospodnetic ha scritto:
Hi,
How do people typically analyze/tokenize text with compounds (e.g.
German)? I took a look at GermanAnalyzer hoping to see how one can
deal with that, but it turns out GermanAnalyzer doesn't treat
compounds in any special way at all.
One way to go about this is to have a word dictionary and a tokenizer
that processes input one character at a time, looking for a word
match in the dictionary after each processed characters. Then,
CompoundWordLikeThis could be broken down into multiple tokens/words
and returned at a set of tokens at the same position. However,
somehow this doesn't strike me as a very smart and fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd
love to hear about it.
Thanks,
Otis
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
"As far as the laws of mathematics refer to reality, they are not certain, as far as
they are certain, they do not refer to reality."
(Albert Einstein)
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]