If you want that query "jacke" matches a document containing the word
"windjacke" or "kinderjacke", you could use a custom update processor.
This processor could search the indexed text for words matching the
pattern ".*jacke" and inject the word "jacke" into an additional field
which you can search against. You would need a whole list of possible
suffixes, of course. It would slow down the update process but you
don't need to split words during search.

Best,
Valeriy

On Thu, Apr 12, 2012 at 12:39 PM, Paul Libbrecht <p...@hoplahup.net> wrote:
>
> Michael,
>
> I'm on this list and the lucene list since several years and have not found 
> this yet.
> It's been one "neglected topics" to my taste.
>
> There is a CompoundAnalyzer but it requires the compounds to be dictionary 
> based, as you indicate.
>
> I am convinced there's a way to build the de-compounding words efficiently 
> from a broad corpus but I have never seen it (and the experts at DFKI I asked 
> for for also told me they didn't know of one).
>
> paul
>
> Le 12 avr. 2012 à 11:52, Michael Ludwig a écrit :
>
>> Given an input of "Windjacke" (probably "wind jacket" in English), I'd
>> like the code that prepares the data for the index (tokenizer etc) to
>> understand that this is a "Jacke" ("jacket") so that a query for "Jacke"
>> would include the "Windjacke" document in its result set.
>>
>> It appears to me that such an analysis requires a dictionary-backed
>> approach, which doesn't have to be perfect at all; a list of the most
>> common 2000 words would probably do the job and fulfil a criterion of
>> reasonable usefulness.
>>
>> Do you know of any implementation techniques or working implementations
>> to do this kind of lexical analysis for German language data? (Or other
>> languages, for that matter?) What are they, where can I find them?
>>
>> I'm sure there is something out (commercial or free) because I've seen
>> lots of engines grokking German and the way it builds words.
>>
>> Failing that, what are the proper terms do refer to these techniques so
>> you can search more successfully?
>>
>> Michael
>

Reply via email to