Re: Lexical analysis tools for German language data

Paul Libbrecht Thu, 12 Apr 2012 03:40:17 -0700

Michael,

I'm on this list and the lucene list since several years and have not found 
this yet.
It's been one "neglected topics" to my taste.


There is a CompoundAnalyzer but it requires the compounds to be dictionary 
based, as you indicate.

I am convinced there's a way to build the de-compounding words efficiently from 
a broad corpus but I have never seen it (and the experts at DFKI I asked for 
for also told me they didn't know of one).

paul

Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :

> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
> like the code that prepares the data for the index (tokenizer etc) to
> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
> would include the "Windjacke" document in its result set.
> 
> It appears to me that such an analysis requires a dictionary-backed
> approach, which doesn't have to be perfect at all; a list of the most
> common 2000 words would probably do the job and fulfil a criterion of
> reasonable usefulness.
> 
> Do you know of any implementation techniques or working implementations
> to do this kind of lexical analysis for German language data? (Or other
> languages, for that matter?) What are they, where can I find them?
> 
> I'm sure there is something out (commercial or free) because I've seen
> lots of engines grokking German and the way it builds words.
> 
> Failing that, what are the proper terms do refer to these techniques so
> you can search more successfully?
> 
> Michael

Re: Lexical analysis tools for German language data

Reply via email to