Hi,

+1 for a lemmatizer API

For my Master's project I created a lemma dictionary, which keys were the
[token + POS tag] and the value one or more lemmas.

To store and access the entries I used a very nice Java tool available
under BSD license that is part of the Morfologik tool (
http://sourceforge.net/projects/morfologik). This tool encodes the
dictionary in a finite-state automata, allowing a very efficient access and
a compact dictionary.
The tool also provide a efficient way of encoding and accessing lexical
dictionaries.

The LanguageTools members wrote a tutorial on how to use Morfologik for
this: http://wiki.languagetool.org/developing-a-tagger-dictionary



On Wed, Apr 10, 2013 at 9:02 AM, Rodrigo Agerri <[email protected]>wrote:

> On Wed, Apr 10, 2013 at 1:00 PM, Jörn Kottmann <[email protected]>
> wrote:>
> >
> > +1, it would be nice to have control over the dictionary, maybe we can
> come
> > up with
> > a format to store it in. That will allow us to easily include it in our
> > models
> > as a resource for feature generation and eliminates the dependency on
> > external libraries.
>
> I do not know yet which dictionary format will be best, but I can try
> to come up with a proposal independent of WordNet or other third party
> resources, when I have it working, and then discuss it.
>
> >
> > +1
> >
> > We should define an interface which allows to use different
> implementations
> > like
> > we did for the other components.
>
> OK.
>
> Cheers,
>
> Rodrigo
>

Reply via email to