> Von: Walter Underwood > German noun decompounding is a little more complicated than it might > seem. > > There can be transformations or inflections, like the "s" in > "Weinachtsbaum" (Weinachten/Baum).
I remember from my linguistics studies that the terminus technicus for these is "Fugenmorphem" (interstitial or joint morpheme). But there's not many of them - phrased in a regex, it's /e?[ns]/. The Weinachtsbaum in the example above is from the singular (die Weihnacht), then "s", then Baum. Still, it's much more complex then, say, English or Italian. > Internal nouns should be recapitalized, like "Baum" above. Casing won't matter for indexing, I think. The way I would go about obtaining stems from compound words is by using a dictionary of stems and a regex. We'll see how far that'll take us. > Some compounds probably should not be decompounded, like "Fahrrad" > (farhren/Rad). With a dictionary-based stemmer, you might decide to > avoid decompounding for words in the dictionary. Good point. > Note that highlighting gets pretty weird when you are matching only > part of a word. Guess it'll be a weird when you get it wrong, like "Noten" in "Notentriegelung". > Luckily, a lot of compounds are simple, and you could well get a > measurable improvement with a very simple algorithm. There isn't > anything complicated about compounds like Orgelmusik or > Netzwerkbetreuer. Exactly. > The Basis Technology linguistic analyzers aren't cheap or small, but > they work well. We will consider our needs and options. Thanks for your thoughts. Michael