Yes, that would be a good step. But actually I was always talking about lexical dictionary of the POS Tagger, which default XML implementation relies on is called POSDictionary, the interface is named TagDictionary. Because it is an interface, to implement it using Morfologic FSA was easy. I also created a Featurizer component for my thesis, but since I was in a hurry, I did not follow OpenNLP structure for that.
I don't know if we can use the Morfologik FSA dictionary for the conventional Dictionary class we have, which entries are multiple tokens. This is used for the abbreviation dictionaries and for the Name Finder. Maybe we can use FSA, but we would have to adapt. On Wed, Apr 10, 2013 at 5:16 PM, Jörn Kottmann <[email protected]> wrote: > The serializer we have currently uses a StringList as key for the > dictionary > and then encodes the stored information in the Entry object, we could move > this up to the dictionary level, e.g.: > > interface Dictionary { > Entry get(StringList key); > } > > Would such an abstraction work for the Morfologik FSA dictionary? > > We have to see how we can make the interface efficient, there should no > expensive > object creation involved for a lookup. > > Jörn > > > On 04/10/2013 05:24 PM, William Colen wrote: > >> On Wed, Apr 10, 2013 at 11:22 AM, Jörn Kottmann <[email protected]> >> wrote: >> >>> Is the memory issue is caused by the fact the dictionaries (e.g. >>> POSDictionary) are using >>> the Java HashMap and String key/values? >>> >>> Yes. The dictionary I have has 800k entries. It is a huge hashmap. >> >> Did you implement your own POSDictionary for your thesis? >> Yes, using Morfologik FSA. >> >> The current dictionary package has an API to read and serialize a >> >>> dictionary from and to the >>> XML format. That could be changed to some binary based format which could >>> be much faster. >>> But as far as I understand is the main issue we have is the >>> representation >>> of the dictionary in memory >>> and not the serialization of it. >>> >> >> When instantiated, the dictionary XML is loaded to a hashtable. This >> process takes a few seconds for a 800k entries dictionary, and depending >> on >> the requirements it might be an issue. >> >> I like the XML implementation, and looks like it works for most of the >> OpenNLP users. But a binary option would be a plus for the ones that need >> it. >> >> I could store Morfologik FSA dictionaries to the model using the custom >> factory API, so it is quite transparent for the users, which can load the >> model even from the CL. The only requirement is to add my Jar with >> customizations to the classpath. >> >> >
