Hi Michael, I had this issue just yesterday. I did that several times and I built a good dictionary in the meantime.
I have an example for Solr or Elasticsearch with the same data. It uses the HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary (it's important to have both). The dictionary-only based one is just too slow and creates wrong matches, too. The rules file is the one from the openoffice hyphenation files. Just take it as is (keep in mind that you need to use the "old" version ZIP file, not the latest version, as the XML format was changed). The dictionary is more important: It should only contain the "single words", no compounds at all. This is hard to get, but there is a ngerman98.zip file available with an ispell dictionary (https://www.j3e.de/ispell/igerman98/). This dictionary has several variants, one of them only contains the single non-compound words (about 17,000 items). This works for most cases. I converted the dictionary a bit, merged some files, and finally lowercased it and now I have a working solution. The settings for the hyphcompound filter are (Elasticsearch): "german_decompounder": { "type": "hyphenation_decompounder", "word_list_path": "analysis/dictionary-de.txt", "hyphenation_patterns_path": "analysis/de_DR.xml", "only_longest_match": true, "min_subword_size": 4 }, Important is the "only_longest_match" setting, because our dictionary for sure only contains "single words" (and some words that look like compounds bare aren't, as they were glued together. See the example in english "policeman" is not written "police man" in English, because it’s a word on its own). So the longest match is always safe as we have a "good maintained" dictionary. If you are interested I can send you a ZIP file with both files. Maybe I should check them into github, but I have to check licenses first. Uwe ----- Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Michael McCandless [mailto:luc...@mikemccandless.com] > Sent: Saturday, September 16, 2017 12:58 AM > To: Lucene Users <java-user@lucene.apache.org> > Subject: German decompounding/tokenization with Lucene? > > Hello, > > I need to index documents with German text in Lucene, and I'm wondering > how > people have done this in the past? > > Lucene already has a DictionaryCompoundWordTokenFilter ... is this what > people use? Are there good, open-source friendly German dictionaries > available? > > Thanks, > > Mike McCandless > > http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org