Re: Morfologik Addon

2016-07-15 Thread William Colen
Not only licensing, but also I think we try to keep OpenNLP without
external dependencies. The Morfologik also has some dependencies itself.


2016-07-15 4:55 GMT-03:00 Rodrigo Agerri :

> Great stuff, William.
>
> I have been using Morfologik stemming for a long time and when we
> included it we put it as an addon. I assume that the reason was its
> license, but reading Morfologik license it is not clear to me why is
> is not Apache compatible.
>
> If it is, it would be nice to include it directly in OpenNLP.
>
> Can anyone shed any light on this?
>
> Thanks,
>
> R
>
> On Fri, Jul 15, 2016 at 12:02 AM, William Colen 
> wrote:
> > Hello,
> >
> > A while back we started working on a Morfologik Addon.
> >
> > http://svn.apache.org/viewvc/opennlp/addons/
> >
> > I checked it out last week and notice it was outdated, specially because
> it
> > was not using the latest Morfologik version. Also it was missing
> > documentation.
> >
> > You can find more about Morfologik here:
> > https://github.com/morfologik/morfologik-stemming
> >
> > Morfologik provides tools for finite state automata (FSA) construction
> and
> > dictionary-based morphological dictionaries.
> >
> > The Morfologik Addon implements some OpenNLP interfaces and extends some
> > classes to make it easier to use of FSA Morfologik dictionaries:
> >
> >- opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
> >   - Extends: opennlp.tools.postag.POSTaggerFactory
> >   - Helps creating a POSTagger model with an embedded TagDictionary
> >   based on FSA
> >- opennlp.morfologik.tagdict.MorfologikTagDictionary
> >- Implements: opennlp.tools.postag.TagDictionary
> >   - A TagDictionary based on FSA is much smaller than the defaul XML
> >   based, and consumes less memory.
> >- opennlp.morfologik.lemmatizer.MorfologikLemmatizer
> >- Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
> >   - A dictionary based lemmatizer that uses FSA dictionary.
> >
> > It also provides a command line interface that allows:
> >
> >- MorfologikDictionaryBuilder
> >   - builds a binary POS Dictionary using Morfologik
> >- XMLDictionaryToTable
> >   - reads an OpenNLP XML tag dictionary and outputs it in a tab
> >   separated file that can be built into a FSA dictionary
> >
> >
> > In a project I developed it was of great help. The TAG Dictionary for POS
> > Tag was huge (something like 50 MB), requiring a lot of memory.
> > Migrating it to a FSA dictionary allowed not only a smaller model, but
> also
> > I could use the model without the need to increase the JVM memory.
> >
> > More here:
> >
> https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+morfologik-addon
> >
> > Hope it will be helpful.
> >
> > William
>


Re: Morfologik Addon

2016-07-15 Thread Rodrigo Agerri
Great stuff, William.

I have been using Morfologik stemming for a long time and when we
included it we put it as an addon. I assume that the reason was its
license, but reading Morfologik license it is not clear to me why is
is not Apache compatible.

If it is, it would be nice to include it directly in OpenNLP.

Can anyone shed any light on this?

Thanks,

R

On Fri, Jul 15, 2016 at 12:02 AM, William Colen  wrote:
> Hello,
>
> A while back we started working on a Morfologik Addon.
>
> http://svn.apache.org/viewvc/opennlp/addons/
>
> I checked it out last week and notice it was outdated, specially because it
> was not using the latest Morfologik version. Also it was missing
> documentation.
>
> You can find more about Morfologik here:
> https://github.com/morfologik/morfologik-stemming
>
> Morfologik provides tools for finite state automata (FSA) construction and
> dictionary-based morphological dictionaries.
>
> The Morfologik Addon implements some OpenNLP interfaces and extends some
> classes to make it easier to use of FSA Morfologik dictionaries:
>
>- opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
>   - Extends: opennlp.tools.postag.POSTaggerFactory
>   - Helps creating a POSTagger model with an embedded TagDictionary
>   based on FSA
>- opennlp.morfologik.tagdict.MorfologikTagDictionary
>- Implements: opennlp.tools.postag.TagDictionary
>   - A TagDictionary based on FSA is much smaller than the defaul XML
>   based, and consumes less memory.
>- opennlp.morfologik.lemmatizer.MorfologikLemmatizer
>- Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
>   - A dictionary based lemmatizer that uses FSA dictionary.
>
> It also provides a command line interface that allows:
>
>- MorfologikDictionaryBuilder
>   - builds a binary POS Dictionary using Morfologik
>- XMLDictionaryToTable
>   - reads an OpenNLP XML tag dictionary and outputs it in a tab
>   separated file that can be built into a FSA dictionary
>
>
> In a project I developed it was of great help. The TAG Dictionary for POS
> Tag was huge (something like 50 MB), requiring a lot of memory.
> Migrating it to a FSA dictionary allowed not only a smaller model, but also
> I could use the model without the need to increase the JVM memory.
>
> More here:
> https://cwiki.apache.org/confluence/display/OPENNLP/FSA+Dictionary+with+morfologik-addon
>
> Hope it will be helpful.
>
> William