RE: German decompounding/tokenization with Lucene?

Uwe Schindler Sat, 16 Sep 2017 03:47:49 -0700

Send a pull request. :)

Uwe


Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma 
<markus.jel...@openindex.io>:
>Hello Uwe,
>
>Thanks for getting rid of the compounds. The dictionary can be smaller,
>it still has about 1500 duplicates. It is also unsorted.
>
>Regards,
>Markus
>
>
>-----Original message-----
>> From:Uwe Schindler <u...@thetaphi.de>
>> Sent: Saturday 16th September 2017 12:16
>> To: java-user@lucene.apache.org
>> Subject: RE: German decompounding/tokenization with Lucene?
>> 
>> Hi,
>> 
>> I published my work on Github:
>> 
>> https://github.com/uschindler/german-decompounder
>> 
>> Have fun. I am not yet 100% sure about the License of the data file.
>The original
>> Author (Björn Jacke) did not publish any license; but LibreOffice
>publishes his files
>> Under LGPL. So to be safe, I applied the same license for my own
>work.
>> 
>> Uwe
>> 
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>> 
>> > -----Original Message-----
>> > From: Uwe Schindler [mailto:u...@thetaphi.de]
>> > Sent: Saturday, September 16, 2017 9:49 AM
>> > To: java-user@lucene.apache.org
>> > Subject: RE: German decompounding/tokenization with Lucene?
>> > 
>> > Hi Michael,
>> > 
>> > I had this issue just yesterday. I did that several times and I
>built a good
>> > dictionary in the meantime.
>> > 
>> > I have an example for Solr or Elasticsearch with the same data. It
>uses the
>> > HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary
>(it's
>> > important to have both). The dictionary-only based one is just too
>slow and
>> > creates wrong matches, too.
>> > 
>> > The rules file is the one from the openoffice hyphenation files.
>Just take it as
>> > is (keep in mind that you need to use the "old" version ZIP file,
>not the latest
>> > version, as the XML format was changed). The dictionary is more
>important:
>> > It should only contain the "single words", no compounds at all.
>This is hard to
>> > get, but there is a ngerman98.zip file available with an ispell
>dictionary
>> > (https://www.j3e.de/ispell/igerman98/). This dictionary has several
>variants,
>> > one of them only contains the single non-compound words (about
>17,000
>> > items). This works for most cases. I converted the dictionary a
>bit, merged
>> > some files, and finally lowercased it and now I have a working
>solution.
>> > 
>> > The settings for the hyphcompound filter are (Elasticsearch):
>> > 
>> >             "german_decompounder": {
>> >                "type": "hyphenation_decompounder",
>> >                "word_list_path": "analysis/dictionary-de.txt",
>> >                "hyphenation_patterns_path": "analysis/de_DR.xml",
>> >                "only_longest_match": true,
>> >                "min_subword_size": 4
>> >             },
>> > 
>> > Important is the "only_longest_match" setting, because our
>dictionary for
>> > sure only contains "single words" (and some words that look like
>compounds
>> > bare aren't, as they were glued together. See the example in
>english
>> > "policeman" is not written "police man" in English, because it’s a
>word on its
>> > own). So the longest match is always safe as we have a "good
>maintained"
>> > dictionary.
>> > 
>> > If you are interested I can send you a ZIP file with both files.
>Maybe I should
>> > check them into github, but I have to check licenses first.
>> > 
>> > Uwe
>> > 
>> > -----
>> > Uwe Schindler
>> > Achterdiek 19, D-28357 Bremen
>> > http://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> > 
>> > > -----Original Message-----
>> > > From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> > > Sent: Saturday, September 16, 2017 12:58 AM
>> > > To: Lucene Users <java-user@lucene.apache.org>
>> > > Subject: German decompounding/tokenization with Lucene?
>> > >
>> > > Hello,
>> > >
>> > > I need to index documents with German text in Lucene, and I'm
>wondering
>> > > how
>> > > people have done this in the past?
>> > >
>> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is
>this what
>> > > people use?  Are there good, open-source friendly German
>dictionaries
>> > > available?
>> > >
>> > > Thanks,
>> > >
>> > > Mike McCandless
>> > >
>> > > http://blog.mikemccandless.com
>> > 
>> > 
>> >
>---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

RE: German decompounding/tokenization with Lucene?

Reply via email to