Yes. I see your point. Here I store the dictionary as a text file and encoding the dictionary is part of the build process, so it is easy to update the dictionary. Maybe we should create an API that supports multiple implementations, a default implementation can use JWNL, which is already available. We can create other implementations in the sandbox as optional packages.
On Wed, Apr 10, 2013 at 10:14 AM, Rodrigo Agerri <[email protected]>wrote: > Hello, > > I used morfologik and LanguageTool for grammar correction. It can be > tricky to create and re-create the binary dictionaries, although it is > true that once is created the speed is very good. > > In any case, that would also create a dependence on morfologik for > creating and accessing the dictionaries. > > Cheers, > > Rodrigo > > > > On Wed, Apr 10, 2013 at 2:34 PM, William Colen <[email protected]> > wrote: > > Hi, > > > > +1 for a lemmatizer API > > > > For my Master's project I created a lemma dictionary, which keys were the > > [token + POS tag] and the value one or more lemmas. > > > > To store and access the entries I used a very nice Java tool available > > under BSD license that is part of the Morfologik tool ( > > http://sourceforge.net/projects/morfologik). This tool encodes the > > dictionary in a finite-state automata, allowing a very efficient access > and > > a compact dictionary. > > The tool also provide a efficient way of encoding and accessing lexical > > dictionaries. > > > > The LanguageTools members wrote a tutorial on how to use Morfologik for > > this: http://wiki.languagetool.org/developing-a-tagger-dictionary > > > > > > > > On Wed, Apr 10, 2013 at 9:02 AM, Rodrigo Agerri <[email protected] > >wrote: > > > >> On Wed, Apr 10, 2013 at 1:00 PM, Jörn Kottmann <[email protected]> > >> wrote:> > >> > > >> > +1, it would be nice to have control over the dictionary, maybe we can > >> come > >> > up with > >> > a format to store it in. That will allow us to easily include it in > our > >> > models > >> > as a resource for feature generation and eliminates the dependency on > >> > external libraries. > >> > >> I do not know yet which dictionary format will be best, but I can try > >> to come up with a proposal independent of WordNet or other third party > >> resources, when I have it working, and then discuss it. > >> > >> > > >> > +1 > >> > > >> > We should define an interface which allows to use different > >> implementations > >> > like > >> > we did for the other components. > >> > >> OK. > >> > >> Cheers, > >> > >> Rodrigo > >> >
