AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
> Von: Tomas Zerolo > > > There can be transformations or inflections, like the "s" in > > > "Weinachtsbaum" (Weinachten/Baum). > > > > I remember from my linguistics studies that the terminus technicus > > for these is "Fugenmorphem" (interstitial or joint morpheme) [...] > > IANAL (I am not a l

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote: > More or less, Fahrrad is generally abbreviated as Rad. > (even though Rad can mean wheel and bike) A synonym could handle this, since "farhren" would not be a good match. It is judgement call, but this seems more like an equivalence "Fahrrad =

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote: > Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : > >> Some compounds probably should not be decompounded, like "Fahrrad" > >> (farhren/Rad). With a dictionary-based stemmer, you might decide to > >> avoid decompounding for words in the dic

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote: > I remember from my linguistics studies that the terminus technicus for > these is "Fugenmorphem" (interstitial or joint morpheme). That is some excellent linguistic jargon. I'll file that with "hapax legomenon". If you don't highlight, you ca

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : >> Some compounds probably should not be decompounded, like "Fahrrad" >> (farhren/Rad). With a dictionary-based stemmer, you might decide to >> avoid decompounding for words in the dictionary. > > Good point. More or less, Fahrrad is generally ab

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Walter Underwood > German noun decompounding is a little more complicated than it might > seem. > > There can be transformations or inflections, like the "s" in > "Weinachtsbaum" (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is "Fugenmorph

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Markus Jelsma > We've done a lot of tests with the HyphenationCompoundWordTokenFilter > using a from TeX generated FOP XML file for the Dutch language and > have seen decent results. A bonus was that now some tokens can be > stemmed properly because not all compounds are listed in the > dic

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Valeriy Felberg > If you want that query "jacke" matches a document containing the word > "windjacke" or "kinderjacke", you could use a custom update processor. > This processor could search the indexed text for words matching the > pattern ".*jacke" and inject the word "jacke" into an addi

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Given an input of "Windjacke" (probably "wind jacket" in English), > I'd like the code that prepares the data for the index (tokenizer > etc) to understand that this is a "Jacke" ("jacket") so that a > query for "Jacke" would include the "Windjacke" document in its > result set. > > It appears t