Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one "neglected topics" to my taste.
There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the de-compounding words efficiently from a broad corpus but I have never seen it (and the experts at DFKI I asked for for also told me they didn't know of one). paul Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit : > Given an input of "Windjacke" (probably "wind jacket" in English), I'd > like the code that prepares the data for the index (tokenizer etc) to > understand that this is a "Jacke" ("jacket") so that a query for "Jacke" > would include the "Windjacke" document in its result set. > > It appears to me that such an analysis requires a dictionary-backed > approach, which doesn't have to be perfect at all; a list of the most > common 2000 words would probably do the job and fulfil a criterion of > reasonable usefulness. > > Do you know of any implementation techniques or working implementations > to do this kind of lexical analysis for German language data? (Or other > languages, for that matter?) What are they, where can I find them? > > I'm sure there is something out (commercial or free) because I've seen > lots of engines grokking German and the way it builds words. > > Failing that, what are the proper terms do refer to these techniques so > you can search more successfully? > > Michael