Ok sorting and deduping should be easy with a simple command line. Reason is that it was created from 2 files of Björn Jacke's Data. I thought that I deduped it...
Uwe Am 16. September 2017 12:46:29 MESZ schrieb Markus Jelsma <markus.jel...@openindex.io>: >Sorry, i would if i were on Github, but i am not. > >Thanks again! >Markus > >-----Original message----- >> From:Uwe Schindler <u...@thetaphi.de> >> Sent: Saturday 16th September 2017 12:45 >> To: java-user@lucene.apache.org >> Subject: RE: German decompounding/tokenization with Lucene? >> >> Send a pull request. :) >> >> Uwe >> >> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma ><markus.jel...@openindex.io>: >> >Hello Uwe, >> > >> >Thanks for getting rid of the compounds. The dictionary can be >smaller, >> >it still has about 1500 duplicates. It is also unsorted. >> > >> >Regards, >> >Markus >> > >> > >> >-----Original message----- >> >> From:Uwe Schindler <u...@thetaphi.de> >> >> Sent: Saturday 16th September 2017 12:16 >> >> To: java-user@lucene.apache.org >> >> Subject: RE: German decompounding/tokenization with Lucene? >> >> >> >> Hi, >> >> >> >> I published my work on Github: >> >> >> >> https://github.com/uschindler/german-decompounder >> >> >> >> Have fun. I am not yet 100% sure about the License of the data >file. >> >The original >> >> Author (Björn Jacke) did not publish any license; but LibreOffice >> >publishes his files >> >> Under LGPL. So to be safe, I applied the same license for my own >> >work. >> >> >> >> Uwe >> >> >> >> ----- >> >> Uwe Schindler >> >> Achterdiek 19, D-28357 Bremen >> >> http://www.thetaphi.de >> >> eMail: u...@thetaphi.de >> >> >> >> > -----Original Message----- >> >> > From: Uwe Schindler [mailto:u...@thetaphi.de] >> >> > Sent: Saturday, September 16, 2017 9:49 AM >> >> > To: java-user@lucene.apache.org >> >> > Subject: RE: German decompounding/tokenization with Lucene? >> >> > >> >> > Hi Michael, >> >> > >> >> > I had this issue just yesterday. I did that several times and I >> >built a good >> >> > dictionary in the meantime. >> >> > >> >> > I have an example for Solr or Elasticsearch with the same data. >It >> >uses the >> >> > HyphenationCompoundTokenFilter, but with ZIP file *and* >dictionary >> >(it's >> >> > important to have both). The dictionary-only based one is just >too >> >slow and >> >> > creates wrong matches, too. >> >> > >> >> > The rules file is the one from the openoffice hyphenation files. >> >Just take it as >> >> > is (keep in mind that you need to use the "old" version ZIP >file, >> >not the latest >> >> > version, as the XML format was changed). The dictionary is more >> >important: >> >> > It should only contain the "single words", no compounds at all. >> >This is hard to >> >> > get, but there is a ngerman98.zip file available with an ispell >> >dictionary >> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has >several >> >variants, >> >> > one of them only contains the single non-compound words (about >> >17,000 >> >> > items). This works for most cases. I converted the dictionary a >> >bit, merged >> >> > some files, and finally lowercased it and now I have a working >> >solution. >> >> > >> >> > The settings for the hyphcompound filter are (Elasticsearch): >> >> > >> >> > "german_decompounder": { >> >> > "type": "hyphenation_decompounder", >> >> > "word_list_path": "analysis/dictionary-de.txt", >> >> > "hyphenation_patterns_path": >"analysis/de_DR.xml", >> >> > "only_longest_match": true, >> >> > "min_subword_size": 4 >> >> > }, >> >> > >> >> > Important is the "only_longest_match" setting, because our >> >dictionary for >> >> > sure only contains "single words" (and some words that look like >> >compounds >> >> > bare aren't, as they were glued together. See the example in >> >english >> >> > "policeman" is not written "police man" in English, because it’s >a >> >word on its >> >> > own). So the longest match is always safe as we have a "good >> >maintained" >> >> > dictionary. >> >> > >> >> > If you are interested I can send you a ZIP file with both files. >> >Maybe I should >> >> > check them into github, but I have to check licenses first. >> >> > >> >> > Uwe >> >> > >> >> > ----- >> >> > Uwe Schindler >> >> > Achterdiek 19, D-28357 Bremen >> >> > http://www.thetaphi.de >> >> > eMail: u...@thetaphi.de >> >> > >> >> > > -----Original Message----- >> >> > > From: Michael McCandless [mailto:luc...@mikemccandless.com] >> >> > > Sent: Saturday, September 16, 2017 12:58 AM >> >> > > To: Lucene Users <java-user@lucene.apache.org> >> >> > > Subject: German decompounding/tokenization with Lucene? >> >> > > >> >> > > Hello, >> >> > > >> >> > > I need to index documents with German text in Lucene, and I'm >> >wondering >> >> > > how >> >> > > people have done this in the past? >> >> > > >> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is >> >this what >> >> > > people use? Are there good, open-source friendly German >> >dictionaries >> >> > > available? >> >> > > >> >> > > Thanks, >> >> > > >> >> > > Mike McCandless >> >> > > >> >> > > http://blog.mikemccandless.com >> >> > >> >> > >> >> > >> >>--------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> > For additional commands, e-mail: >java-user-h...@lucene.apache.org >> >> >> >> >> >> >--------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> > >> >>--------------------------------------------------------------------- >> >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> -- >> Uwe Schindler >> Achterdiek 19, 28357 Bremen >> https://www.thetaphi.de > >--------------------------------------------------------------------- >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >For additional commands, e-mail: java-user-h...@lucene.apache.org -- Uwe Schindler Achterdiek 19, 28357 Bremen https://www.thetaphi.de