Hi, +1 for a lemmatizer API
For my Master's project I created a lemma dictionary, which keys were the [token + POS tag] and the value one or more lemmas. To store and access the entries I used a very nice Java tool available under BSD license that is part of the Morfologik tool ( http://sourceforge.net/projects/morfologik). This tool encodes the dictionary in a finite-state automata, allowing a very efficient access and a compact dictionary. The tool also provide a efficient way of encoding and accessing lexical dictionaries. The LanguageTools members wrote a tutorial on how to use Morfologik for this: http://wiki.languagetool.org/developing-a-tagger-dictionary On Wed, Apr 10, 2013 at 9:02 AM, Rodrigo Agerri <[email protected]>wrote: > On Wed, Apr 10, 2013 at 1:00 PM, Jörn Kottmann <[email protected]> > wrote:> > > > > +1, it would be nice to have control over the dictionary, maybe we can > come > > up with > > a format to store it in. That will allow us to easily include it in our > > models > > as a resource for feature generation and eliminates the dependency on > > external libraries. > > I do not know yet which dictionary format will be best, but I can try > to come up with a proposal independent of WordNet or other third party > resources, when I have it working, and then discuss it. > > > > > +1 > > > > We should define an interface which allows to use different > implementations > > like > > we did for the other components. > > OK. > > Cheers, > > Rodrigo >
