RE: German decompounding/tokenization with Lucene?

Uwe Schindler Sat, 16 Sep 2017 00:49:36 -0700

Hi Michael,

I had this issue just yesterday. I did that several times and I built a good 
dictionary in the meantime.


I have an example for Solr or Elasticsearch with the same data. It uses the 
HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary (it's 
important to have both). The dictionary-only based one is just too slow and 
creates wrong matches, too.

The rules file is the one from the openoffice hyphenation files. Just take it 
as is (keep in mind that you need to use the "old" version ZIP file, not the 
latest version, as the XML format was changed). The dictionary is more 
important: It should only contain the "single words", no compounds at all. This 
is hard to get, but there is a ngerman98.zip file available with an ispell 
dictionary (https://www.j3e.de/ispell/igerman98/). This dictionary has several 
variants, one of them only contains the single non-compound words (about 17,000 
items). This works for most cases. I converted the dictionary a bit, merged 
some files, and finally lowercased it and now I have a working solution.

The settings for the hyphcompound filter are (Elasticsearch):

            "german_decompounder": {
               "type": "hyphenation_decompounder",
               "word_list_path": "analysis/dictionary-de.txt",
               "hyphenation_patterns_path": "analysis/de_DR.xml",
               "only_longest_match": true,
               "min_subword_size": 4
            },

Important is the "only_longest_match" setting, because our dictionary for sure 
only contains "single words" (and some words that look like compounds bare 
aren't, as they were glued together. See the example in english "policeman" is 
not written "police man" in English, because it’s a word on its own). So the 
longest match is always safe as we have a "good maintained" dictionary.

If you are interested I can send you a ZIP file with both files. Maybe I should 
check them into github, but I have to check licenses first.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Saturday, September 16, 2017 12:58 AM
> To: Lucene Users <java-user@lucene.apache.org>
> Subject: German decompounding/tokenization with Lucene?
> 
> Hello,
> 
> I need to index documents with German text in Lucene, and I'm wondering
> how
> people have done this in the past?
> 
> Lucene already has a DictionaryCompoundWordTokenFilter ... is this what
> people use?  Are there good, open-source friendly German dictionaries
> available?
> 
> Thanks,
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: German decompounding/tokenization with Lucene?

Reply via email to