Hi Otis,
I am completing my bachelor thesis at the Free University of Bolzano
(www.unibz.it). My project is exactly about what you need: a word
splitter for German compound words. Raffaella Bernardi who is reading in
CC is my supervisor.
As some from the lucene mailing list has already suggested, I have used
the lexicon of German nouns extracted from Morphy
(http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). As for the
splitting algorithm, I have used the one Maaten De Rijke and Christof
Monz have published in /Shallow Morphological Analysis in Monolingual
Information Retrieval for Dutch, German and Italian /(website here
<http://www.dcs.qmul.ac.uk/%7Echristof/>, document here
<http://www.dcs.qmul.ac.uk/%7Echristof/publications/clef-2001-post.pdf>).
I did some testing and minor improvement on it (as I needed to "adjust"
it for the solution I was working on) and could send you my thesis paper
(actually still in draft state), which contains statistical data on
correctness.
Let me know
Pasquale
Otis Gospodnetic ha scritto:
Hi,
How do people typically analyze/tokenize text with compounds (e.g. German)? I
took a look at GermanAnalyzer hoping to see how one can deal with that, but it
turns out GermanAnalyzer doesn't treat compounds in any special way at all.
One way to go about this is to have a word dictionary and a tokenizer that
processes input one character at a time, looking for a word match in the
dictionary after each processed characters. Then, CompoundWordLikeThis could
be broken down into multiple tokens/words and returned at a set of tokens at
the same position. However, somehow this doesn't strike me as a very smart and
fast approach.
What are some better approaches?
If anyone has implemented anything that deals with this problem, I'd love to
hear about it.
Thanks,
Otis
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
"As far as the laws of mathematics refer to reality, they are not certain, as far as
they are certain, they do not refer to reality."
(Albert Einstein)
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]