AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
Von: Tomas Zerolo There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme) [...] IANAL (I am not a linguist -- pun

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Given an input of Windjacke (probably wind jacket in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a Jacke (jacket) so that a query for Jacke would include the Windjacke document in its result set. It appears to me that such an

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Von: Valeriy Felberg If you want that query jacke matches a document containing the word windjacke or kinderjacke, you could use a custom update processor. This processor could search the indexed text for words matching the pattern .*jacke and inject the word jacke into an additional field

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Von: Markus Jelsma We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Von: Walter Underwood German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the s in Weinachtsbaum (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary. Good point. More or less, Fahrrad is generally abbreviated

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote: I remember from my linguistics studies that the terminus technicus for these is Fugenmorphem (interstitial or joint morpheme). That is some excellent linguistic jargon. I'll file that with hapax legomenon. If you don't highlight, you can get

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote: Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : Some compounds probably should not be decompounded, like Fahrrad (farhren/Rad). With a dictionary-based stemmer, you might decide to avoid decompounding for words in the dictionary.

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote: More or less, Fahrrad is generally abbreviated as Rad. (even though Rad can mean wheel and bike) A synonym could handle this, since farhren would not be a good match. It is judgement call, but this seems more like an equivalence Fahrrad = Rad