> Von: Walter Underwood

> German noun decompounding is a little more complicated than it might
> seem.
> 
> There can be transformations or inflections, like the "s" in
> "Weinachtsbaum" (Weinachten/Baum).

I remember from my linguistics studies that the terminus technicus for
these is "Fugenmorphem" (interstitial or joint morpheme). But there's
not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum
in the example above is from the singular (die Weihnacht), then "s",
then Baum. Still, it's much more complex then, say, English or Italian.

> Internal nouns should be recapitalized, like "Baum" above.

Casing won't matter for indexing, I think. The way I would go about
obtaining stems from compound words is by using a dictionary of stems
and a regex. We'll see how far that'll take us.

> Some compounds probably should not be decompounded, like "Fahrrad"
> (farhren/Rad). With a dictionary-based stemmer, you might decide to
> avoid decompounding for words in the dictionary.

Good point.

> Note that highlighting gets pretty weird when you are matching only
> part of a word.

Guess it'll be a weird when you get it wrong, like "Noten" in
"Notentriegelung".

> Luckily, a lot of compounds are simple, and you could well get a
> measurable improvement with a very simple algorithm. There isn't
> anything complicated about compounds like Orgelmusik or
> Netzwerkbetreuer.

Exactly.

> The Basis Technology linguistic analyzers aren't cheap or small, but
> they work well.

We will consider our needs and options. Thanks for your thoughts.

Michael

Reply via email to