Whoa, thank you Uwe! I will have a look; too bad about the licensing, but I know dictionaries are often licensed with LGPL.
Mike McCandless http://blog.mikemccandless.com On Sat, Sep 16, 2017 at 7:03 AM, Uwe Schindler <[email protected]> wrote: > Hi, > > I deduped it. Thanks for the hint! > > Uwe > > ----- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: [email protected] > > > -----Original Message----- > > From: Uwe Schindler [mailto:[email protected]] > > Sent: Saturday, September 16, 2017 12:51 PM > > To: [email protected] > > Subject: RE: German decompounding/tokenization with Lucene? > > > > Ok sorting and deduping should be easy with a simple command line. Reason > > is that it was created from 2 files of Björn Jacke's Data. I thought > that I > > deduped it... > > > > Uwe > > > > Am 16. September 2017 12:46:29 MESZ schrieb Markus Jelsma > > <[email protected]>: > > >Sorry, i would if i were on Github, but i am not. > > > > > >Thanks again! > > >Markus > > > > > >-----Original message----- > > >> From:Uwe Schindler <[email protected]> > > >> Sent: Saturday 16th September 2017 12:45 > > >> To: [email protected] > > >> Subject: RE: German decompounding/tokenization with Lucene? > > >> > > >> Send a pull request. :) > > >> > > >> Uwe > > >> > > >> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma > > ><[email protected]>: > > >> >Hello Uwe, > > >> > > > >> >Thanks for getting rid of the compounds. The dictionary can be > > >smaller, > > >> >it still has about 1500 duplicates. It is also unsorted. > > >> > > > >> >Regards, > > >> >Markus > > >> > > > >> > > > >> >-----Original message----- > > >> >> From:Uwe Schindler <[email protected]> > > >> >> Sent: Saturday 16th September 2017 12:16 > > >> >> To: [email protected] > > >> >> Subject: RE: German decompounding/tokenization with Lucene? > > >> >> > > >> >> Hi, > > >> >> > > >> >> I published my work on Github: > > >> >> > > >> >> https://github.com/uschindler/german-decompounder > > >> >> > > >> >> Have fun. I am not yet 100% sure about the License of the data > > >file. > > >> >The original > > >> >> Author (Björn Jacke) did not publish any license; but LibreOffice > > >> >publishes his files > > >> >> Under LGPL. So to be safe, I applied the same license for my own > > >> >work. > > >> >> > > >> >> Uwe > > >> >> > > >> >> ----- > > >> >> Uwe Schindler > > >> >> Achterdiek 19, D-28357 Bremen > > >> >> http://www.thetaphi.de > > >> >> eMail: [email protected] > > >> >> > > >> >> > -----Original Message----- > > >> >> > From: Uwe Schindler [mailto:[email protected]] > > >> >> > Sent: Saturday, September 16, 2017 9:49 AM > > >> >> > To: [email protected] > > >> >> > Subject: RE: German decompounding/tokenization with Lucene? > > >> >> > > > >> >> > Hi Michael, > > >> >> > > > >> >> > I had this issue just yesterday. I did that several times and I > > >> >built a good > > >> >> > dictionary in the meantime. > > >> >> > > > >> >> > I have an example for Solr or Elasticsearch with the same data. > > >It > > >> >uses the > > >> >> > HyphenationCompoundTokenFilter, but with ZIP file *and* > > >dictionary > > >> >(it's > > >> >> > important to have both). The dictionary-only based one is just > > >too > > >> >slow and > > >> >> > creates wrong matches, too. > > >> >> > > > >> >> > The rules file is the one from the openoffice hyphenation files. > > >> >Just take it as > > >> >> > is (keep in mind that you need to use the "old" version ZIP > > >file, > > >> >not the latest > > >> >> > version, as the XML format was changed). The dictionary is more > > >> >important: > > >> >> > It should only contain the "single words", no compounds at all. > > >> >This is hard to > > >> >> > get, but there is a ngerman98.zip file available with an ispell > > >> >dictionary > > >> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has > > >several > > >> >variants, > > >> >> > one of them only contains the single non-compound words (about > > >> >17,000 > > >> >> > items). This works for most cases. I converted the dictionary a > > >> >bit, merged > > >> >> > some files, and finally lowercased it and now I have a working > > >> >solution. > > >> >> > > > >> >> > The settings for the hyphcompound filter are (Elasticsearch): > > >> >> > > > >> >> > "german_decompounder": { > > >> >> > "type": "hyphenation_decompounder", > > >> >> > "word_list_path": "analysis/dictionary-de.txt", > > >> >> > "hyphenation_patterns_path": > > >"analysis/de_DR.xml", > > >> >> > "only_longest_match": true, > > >> >> > "min_subword_size": 4 > > >> >> > }, > > >> >> > > > >> >> > Important is the "only_longest_match" setting, because our > > >> >dictionary for > > >> >> > sure only contains "single words" (and some words that look like > > >> >compounds > > >> >> > bare aren't, as they were glued together. See the example in > > >> >english > > >> >> > "policeman" is not written "police man" in English, because it’s > > >a > > >> >word on its > > >> >> > own). So the longest match is always safe as we have a "good > > >> >maintained" > > >> >> > dictionary. > > >> >> > > > >> >> > If you are interested I can send you a ZIP file with both files. > > >> >Maybe I should > > >> >> > check them into github, but I have to check licenses first. > > >> >> > > > >> >> > Uwe > > >> >> > > > >> >> > ----- > > >> >> > Uwe Schindler > > >> >> > Achterdiek 19, D-28357 Bremen > > >> >> > http://www.thetaphi.de > > >> >> > eMail: [email protected] > > >> >> > > > >> >> > > -----Original Message----- > > >> >> > > From: Michael McCandless [mailto:[email protected]] > > >> >> > > Sent: Saturday, September 16, 2017 12:58 AM > > >> >> > > To: Lucene Users <[email protected]> > > >> >> > > Subject: German decompounding/tokenization with Lucene? > > >> >> > > > > >> >> > > Hello, > > >> >> > > > > >> >> > > I need to index documents with German text in Lucene, and I'm > > >> >wondering > > >> >> > > how > > >> >> > > people have done this in the past? > > >> >> > > > > >> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is > > >> >this what > > >> >> > > people use? Are there good, open-source friendly German > > >> >dictionaries > > >> >> > > available? > > >> >> > > > > >> >> > > Thanks, > > >> >> > > > > >> >> > > Mike McCandless > > >> >> > > > > >> >> > > http://blog.mikemccandless.com > > >> >> > > > >> >> > > > >> >> > > > >> > > >>--------------------------------------------------------------------- > > >> >> > To unsubscribe, e-mail: [email protected] > > >> >> > For additional commands, e-mail: > > >[email protected] > > >> >> > > >> >> > > >> >> > > >--------------------------------------------------------------------- > > >> >> To unsubscribe, e-mail: [email protected] > > >> >> For additional commands, e-mail: [email protected] > > >> >> > > >> >> > > >> > > > >> > > >>--------------------------------------------------------------------- > > >> >To unsubscribe, e-mail: [email protected] > > >> >For additional commands, e-mail: [email protected] > > >> > > >> -- > > >> Uwe Schindler > > >> Achterdiek 19, 28357 Bremen > > >> https://www.thetaphi.de > > > > > >--------------------------------------------------------------------- > > >To unsubscribe, e-mail: [email protected] > > >For additional commands, e-mail: [email protected] > > > > -- > > Uwe Schindler > > Achterdiek 19, 28357 Bremen > > https://www.thetaphi.de > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
