Sorry, i would if i were on Github, but i am not. Thanks again! Markus
-----Original message----- > From:Uwe Schindler <u...@thetaphi.de> > Sent: Saturday 16th September 2017 12:45 > To: java-user@lucene.apache.org > Subject: RE: German decompounding/tokenization with Lucene? > > Send a pull request. :) > > Uwe > > Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma > <markus.jel...@openindex.io>: > >Hello Uwe, > > > >Thanks for getting rid of the compounds. The dictionary can be smaller, > >it still has about 1500 duplicates. It is also unsorted. > > > >Regards, > >Markus > > > > > >-----Original message----- > >> From:Uwe Schindler <u...@thetaphi.de> > >> Sent: Saturday 16th September 2017 12:16 > >> To: java-user@lucene.apache.org > >> Subject: RE: German decompounding/tokenization with Lucene? > >> > >> Hi, > >> > >> I published my work on Github: > >> > >> https://github.com/uschindler/german-decompounder > >> > >> Have fun. I am not yet 100% sure about the License of the data file. > >The original > >> Author (Björn Jacke) did not publish any license; but LibreOffice > >publishes his files > >> Under LGPL. So to be safe, I applied the same license for my own > >work. > >> > >> Uwe > >> > >> ----- > >> Uwe Schindler > >> Achterdiek 19, D-28357 Bremen > >> http://www.thetaphi.de > >> eMail: u...@thetaphi.de > >> > >> > -----Original Message----- > >> > From: Uwe Schindler [mailto:u...@thetaphi.de] > >> > Sent: Saturday, September 16, 2017 9:49 AM > >> > To: java-user@lucene.apache.org > >> > Subject: RE: German decompounding/tokenization with Lucene? > >> > > >> > Hi Michael, > >> > > >> > I had this issue just yesterday. I did that several times and I > >built a good > >> > dictionary in the meantime. > >> > > >> > I have an example for Solr or Elasticsearch with the same data. It > >uses the > >> > HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary > >(it's > >> > important to have both). The dictionary-only based one is just too > >slow and > >> > creates wrong matches, too. > >> > > >> > The rules file is the one from the openoffice hyphenation files. > >Just take it as > >> > is (keep in mind that you need to use the "old" version ZIP file, > >not the latest > >> > version, as the XML format was changed). The dictionary is more > >important: > >> > It should only contain the "single words", no compounds at all. > >This is hard to > >> > get, but there is a ngerman98.zip file available with an ispell > >dictionary > >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has several > >variants, > >> > one of them only contains the single non-compound words (about > >17,000 > >> > items). This works for most cases. I converted the dictionary a > >bit, merged > >> > some files, and finally lowercased it and now I have a working > >solution. > >> > > >> > The settings for the hyphcompound filter are (Elasticsearch): > >> > > >> > "german_decompounder": { > >> > "type": "hyphenation_decompounder", > >> > "word_list_path": "analysis/dictionary-de.txt", > >> > "hyphenation_patterns_path": "analysis/de_DR.xml", > >> > "only_longest_match": true, > >> > "min_subword_size": 4 > >> > }, > >> > > >> > Important is the "only_longest_match" setting, because our > >dictionary for > >> > sure only contains "single words" (and some words that look like > >compounds > >> > bare aren't, as they were glued together. See the example in > >english > >> > "policeman" is not written "police man" in English, because it’s a > >word on its > >> > own). So the longest match is always safe as we have a "good > >maintained" > >> > dictionary. > >> > > >> > If you are interested I can send you a ZIP file with both files. > >Maybe I should > >> > check them into github, but I have to check licenses first. > >> > > >> > Uwe > >> > > >> > ----- > >> > Uwe Schindler > >> > Achterdiek 19, D-28357 Bremen > >> > http://www.thetaphi.de > >> > eMail: u...@thetaphi.de > >> > > >> > > -----Original Message----- > >> > > From: Michael McCandless [mailto:luc...@mikemccandless.com] > >> > > Sent: Saturday, September 16, 2017 12:58 AM > >> > > To: Lucene Users <java-user@lucene.apache.org> > >> > > Subject: German decompounding/tokenization with Lucene? > >> > > > >> > > Hello, > >> > > > >> > > I need to index documents with German text in Lucene, and I'm > >wondering > >> > > how > >> > > people have done this in the past? > >> > > > >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is > >this what > >> > > people use? Are there good, open-source friendly German > >dictionaries > >> > > available? > >> > > > >> > > Thanks, > >> > > > >> > > Mike McCandless > >> > > > >> > > http://blog.mikemccandless.com > >> > > >> > > >> > > >--------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- > Uwe Schindler > Achterdiek 19, 28357 Bremen > https://www.thetaphi.de --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org