Re: German decompounding/tokenization with Lucene?

Michael McCandless Sat, 16 Sep 2017 07:06:23 -0700

Whoa, thank you Uwe!  I will have a look; too bad about the licensing, but
I know dictionaries are often licensed with LGPL.


Mike McCandless

http://blog.mikemccandless.com

On Sat, Sep 16, 2017 at 7:03 AM, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi,
>
> I deduped it. Thanks for the hint!
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -----Original Message-----
> > From: Uwe Schindler [mailto:u...@thetaphi.de]
> > Sent: Saturday, September 16, 2017 12:51 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: German decompounding/tokenization with Lucene?
> >
> > Ok sorting and deduping should be easy with a simple command line. Reason
> > is that it was created from 2 files of Björn Jacke's Data. I thought
> that I
> > deduped it...
> >
> > Uwe
> >
> > Am 16. September 2017 12:46:29 MESZ schrieb Markus Jelsma
> > <markus.jel...@openindex.io>:
> > >Sorry, i would if i were on Github, but i am not.
> > >
> > >Thanks again!
> > >Markus
> > >
> > >-----Original message-----
> > >> From:Uwe Schindler <u...@thetaphi.de>
> > >> Sent: Saturday 16th September 2017 12:45
> > >> To: java-user@lucene.apache.org
> > >> Subject: RE: German decompounding/tokenization with Lucene?
> > >>
> > >> Send a pull request. :)
> > >>
> > >> Uwe
> > >>
> > >> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma
> > ><markus.jel...@openindex.io>:
> > >> >Hello Uwe,
> > >> >
> > >> >Thanks for getting rid of the compounds. The dictionary can be
> > >smaller,
> > >> >it still has about 1500 duplicates. It is also unsorted.
> > >> >
> > >> >Regards,
> > >> >Markus
> > >> >
> > >> >
> > >> >-----Original message-----
> > >> >> From:Uwe Schindler <u...@thetaphi.de>
> > >> >> Sent: Saturday 16th September 2017 12:16
> > >> >> To: java-user@lucene.apache.org
> > >> >> Subject: RE: German decompounding/tokenization with Lucene?
> > >> >>
> > >> >> Hi,
> > >> >>
> > >> >> I published my work on Github:
> > >> >>
> > >> >> https://github.com/uschindler/german-decompounder
> > >> >>
> > >> >> Have fun. I am not yet 100% sure about the License of the data
> > >file.
> > >> >The original
> > >> >> Author (Björn Jacke) did not publish any license; but LibreOffice
> > >> >publishes his files
> > >> >> Under LGPL. So to be safe, I applied the same license for my own
> > >> >work.
> > >> >>
> > >> >> Uwe
> > >> >>
> > >> >> -----
> > >> >> Uwe Schindler
> > >> >> Achterdiek 19, D-28357 Bremen
> > >> >> http://www.thetaphi.de
> > >> >> eMail: u...@thetaphi.de
> > >> >>
> > >> >> > -----Original Message-----
> > >> >> > From: Uwe Schindler [mailto:u...@thetaphi.de]
> > >> >> > Sent: Saturday, September 16, 2017 9:49 AM
> > >> >> > To: java-user@lucene.apache.org
> > >> >> > Subject: RE: German decompounding/tokenization with Lucene?
> > >> >> >
> > >> >> > Hi Michael,
> > >> >> >
> > >> >> > I had this issue just yesterday. I did that several times and I
> > >> >built a good
> > >> >> > dictionary in the meantime.
> > >> >> >
> > >> >> > I have an example for Solr or Elasticsearch with the same data.
> > >It
> > >> >uses the
> > >> >> > HyphenationCompoundTokenFilter, but with ZIP file *and*
> > >dictionary
> > >> >(it's
> > >> >> > important to have both). The dictionary-only based one is just
> > >too
> > >> >slow and
> > >> >> > creates wrong matches, too.
> > >> >> >
> > >> >> > The rules file is the one from the openoffice hyphenation files.
> > >> >Just take it as
> > >> >> > is (keep in mind that you need to use the "old" version ZIP
> > >file,
> > >> >not the latest
> > >> >> > version, as the XML format was changed). The dictionary is more
> > >> >important:
> > >> >> > It should only contain the "single words", no compounds at all.
> > >> >This is hard to
> > >> >> > get, but there is a ngerman98.zip file available with an ispell
> > >> >dictionary
> > >> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has
> > >several
> > >> >variants,
> > >> >> > one of them only contains the single non-compound words (about
> > >> >17,000
> > >> >> > items). This works for most cases. I converted the dictionary a
> > >> >bit, merged
> > >> >> > some files, and finally lowercased it and now I have a working
> > >> >solution.
> > >> >> >
> > >> >> > The settings for the hyphcompound filter are (Elasticsearch):
> > >> >> >
> > >> >> >             "german_decompounder": {
> > >> >> >                "type": "hyphenation_decompounder",
> > >> >> >                "word_list_path": "analysis/dictionary-de.txt",
> > >> >> >                "hyphenation_patterns_path":
> > >"analysis/de_DR.xml",
> > >> >> >                "only_longest_match": true,
> > >> >> >                "min_subword_size": 4
> > >> >> >             },
> > >> >> >
> > >> >> > Important is the "only_longest_match" setting, because our
> > >> >dictionary for
> > >> >> > sure only contains "single words" (and some words that look like
> > >> >compounds
> > >> >> > bare aren't, as they were glued together. See the example in
> > >> >english
> > >> >> > "policeman" is not written "police man" in English, because it’s
> > >a
> > >> >word on its
> > >> >> > own). So the longest match is always safe as we have a "good
> > >> >maintained"
> > >> >> > dictionary.
> > >> >> >
> > >> >> > If you are interested I can send you a ZIP file with both files.
> > >> >Maybe I should
> > >> >> > check them into github, but I have to check licenses first.
> > >> >> >
> > >> >> > Uwe
> > >> >> >
> > >> >> > -----
> > >> >> > Uwe Schindler
> > >> >> > Achterdiek 19, D-28357 Bremen
> > >> >> > http://www.thetaphi.de
> > >> >> > eMail: u...@thetaphi.de
> > >> >> >
> > >> >> > > -----Original Message-----
> > >> >> > > From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > >> >> > > Sent: Saturday, September 16, 2017 12:58 AM
> > >> >> > > To: Lucene Users <java-user@lucene.apache.org>
> > >> >> > > Subject: German decompounding/tokenization with Lucene?
> > >> >> > >
> > >> >> > > Hello,
> > >> >> > >
> > >> >> > > I need to index documents with German text in Lucene, and I'm
> > >> >wondering
> > >> >> > > how
> > >> >> > > people have done this in the past?
> > >> >> > >
> > >> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is
> > >> >this what
> > >> >> > > people use?  Are there good, open-source friendly German
> > >> >dictionaries
> > >> >> > > available?
> > >> >> > >
> > >> >> > > Thanks,
> > >> >> > >
> > >> >> > > Mike McCandless
> > >> >> > >
> > >> >> > > http://blog.mikemccandless.com
> > >> >> >
> > >> >> >
> > >> >> >
> > >>
> > >>---------------------------------------------------------------------
> > >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> >> > For additional commands, e-mail:
> > >java-user-h...@lucene.apache.org
> > >> >>
> > >> >>
> > >> >>
> > >---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >> >>
> > >> >>
> > >> >
> > >>
> > >>---------------------------------------------------------------------
> > >> >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> >For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >> --
> > >> Uwe Schindler
> > >> Achterdiek 19, 28357 Bremen
> > >> https://www.thetaphi.de
> > >
> > >---------------------------------------------------------------------
> > >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, 28357 Bremen
> > https://www.thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: German decompounding/tokenization with Lucene?

Reply via email to