Send a pull request. :) Uwe
Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma <markus.jel...@openindex.io>: >Hello Uwe, > >Thanks for getting rid of the compounds. The dictionary can be smaller, >it still has about 1500 duplicates. It is also unsorted. > >Regards, >Markus > > >-----Original message----- >> From:Uwe Schindler <u...@thetaphi.de> >> Sent: Saturday 16th September 2017 12:16 >> To: java-user@lucene.apache.org >> Subject: RE: German decompounding/tokenization with Lucene? >> >> Hi, >> >> I published my work on Github: >> >> https://github.com/uschindler/german-decompounder >> >> Have fun. I am not yet 100% sure about the License of the data file. >The original >> Author (Björn Jacke) did not publish any license; but LibreOffice >publishes his files >> Under LGPL. So to be safe, I applied the same license for my own >work. >> >> Uwe >> >> ----- >> Uwe Schindler >> Achterdiek 19, D-28357 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -----Original Message----- >> > From: Uwe Schindler [mailto:u...@thetaphi.de] >> > Sent: Saturday, September 16, 2017 9:49 AM >> > To: java-user@lucene.apache.org >> > Subject: RE: German decompounding/tokenization with Lucene? >> > >> > Hi Michael, >> > >> > I had this issue just yesterday. I did that several times and I >built a good >> > dictionary in the meantime. >> > >> > I have an example for Solr or Elasticsearch with the same data. It >uses the >> > HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary >(it's >> > important to have both). The dictionary-only based one is just too >slow and >> > creates wrong matches, too. >> > >> > The rules file is the one from the openoffice hyphenation files. >Just take it as >> > is (keep in mind that you need to use the "old" version ZIP file, >not the latest >> > version, as the XML format was changed). The dictionary is more >important: >> > It should only contain the "single words", no compounds at all. >This is hard to >> > get, but there is a ngerman98.zip file available with an ispell >dictionary >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has several >variants, >> > one of them only contains the single non-compound words (about >17,000 >> > items). This works for most cases. I converted the dictionary a >bit, merged >> > some files, and finally lowercased it and now I have a working >solution. >> > >> > The settings for the hyphcompound filter are (Elasticsearch): >> > >> > "german_decompounder": { >> > "type": "hyphenation_decompounder", >> > "word_list_path": "analysis/dictionary-de.txt", >> > "hyphenation_patterns_path": "analysis/de_DR.xml", >> > "only_longest_match": true, >> > "min_subword_size": 4 >> > }, >> > >> > Important is the "only_longest_match" setting, because our >dictionary for >> > sure only contains "single words" (and some words that look like >compounds >> > bare aren't, as they were glued together. See the example in >english >> > "policeman" is not written "police man" in English, because it’s a >word on its >> > own). So the longest match is always safe as we have a "good >maintained" >> > dictionary. >> > >> > If you are interested I can send you a ZIP file with both files. >Maybe I should >> > check them into github, but I have to check licenses first. >> > >> > Uwe >> > >> > ----- >> > Uwe Schindler >> > Achterdiek 19, D-28357 Bremen >> > http://www.thetaphi.de >> > eMail: u...@thetaphi.de >> > >> > > -----Original Message----- >> > > From: Michael McCandless [mailto:luc...@mikemccandless.com] >> > > Sent: Saturday, September 16, 2017 12:58 AM >> > > To: Lucene Users <java-user@lucene.apache.org> >> > > Subject: German decompounding/tokenization with Lucene? >> > > >> > > Hello, >> > > >> > > I need to index documents with German text in Lucene, and I'm >wondering >> > > how >> > > people have done this in the past? >> > > >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is >this what >> > > people use? Are there good, open-source friendly German >dictionaries >> > > available? >> > > >> > > Thanks, >> > > >> > > Mike McCandless >> > > >> > > http://blog.mikemccandless.com >> > >> > >> > >--------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > >--------------------------------------------------------------------- >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >For additional commands, e-mail: java-user-h...@lucene.apache.org -- Uwe Schindler Achterdiek 19, 28357 Bremen https://www.thetaphi.de