Hello,

A while back we started working on a Morfologik Addon.

http://svn.apache.org/viewvc/opennlp/addons/

I checked it out last week and notice it was outdated, specially because it
was not using the latest Morfologik version. Also it was missing
documentation.

You can find more about Morfologik here:
https://github.com/morfologik/morfologik-stemming

Morfologik provides tools for finite state automata (FSA) construction and
dictionary-based morphological dictionaries.

The Morfologik Addon implements some OpenNLP interfaces and extends some
classes to make it easier to use of FSA Morfologik dictionaries:

   - opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
      - Extends: opennlp.tools.postag.POSTaggerFactory
      - Helps creating a POSTagger model with an embedded TagDictionary
      based on FSA
   - opennlp.morfologik.tagdict.MorfologikTagDictionary
   - Implements: opennlp.tools.postag.TagDictionary
      - A TagDictionary based on FSA is much smaller than the defaul XML
      based, and consumes less memory.
   - opennlp.morfologik.lemmatizer.MorfologikLemmatizer
   - Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
      - A dictionary based lemmatizer that uses FSA dictionary.

It also provides a command line interface that allows:

   - MorfologikDictionaryBuilder
      - builds a binary POS Dictionary using Morfologik
   - XMLDictionaryToTable
      - reads an OpenNLP XML tag dictionary and outputs it in a tab
      separated file that can be built into a FSA dictionary


In a project I developed it was of great help. The TAG Dictionary for POS
Tag was huge (something like 50 MB), requiring a lot of memory.
Migrating it to a FSA dictionary allowed not only a smaller model, but also
I could use the model without the need to increase the JVM memory.

More here:
https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+morfologik-addon

Hope it will be helpful.

William

Reply via email to