The serializer we have currently uses a StringList as key for the dictionary
and then encodes the stored information in the Entry object, we could move
this up to the dictionary level, e.g.:
interface Dictionary {
Entry get(StringList key);
}
Would such an abstraction work for the Morfologik FSA dictionary?
We have to see how we can make the interface efficient, there should no
expensive
object creation involved for a lookup.
Jörn
On 04/10/2013 05:24 PM, William Colen wrote:
On Wed, Apr 10, 2013 at 11:22 AM, Jörn Kottmann <[email protected]> wrote:
Is the memory issue is caused by the fact the dictionaries (e.g.
POSDictionary) are using
the Java HashMap and String key/values?
Yes. The dictionary I have has 800k entries. It is a huge hashmap.
Did you implement your own POSDictionary for your thesis?
Yes, using Morfologik FSA.
The current dictionary package has an API to read and serialize a
dictionary from and to the
XML format. That could be changed to some binary based format which could
be much faster.
But as far as I understand is the main issue we have is the representation
of the dictionary in memory
and not the serialization of it.
When instantiated, the dictionary XML is loaded to a hashtable. This
process takes a few seconds for a 800k entries dictionary, and depending on
the requirements it might be an issue.
I like the XML implementation, and looks like it works for most of the
OpenNLP users. But a binary option would be a plus for the ones that need
it.
I could store Morfologik FSA dictionaries to the model using the custom
factory API, so it is quite transparent for the users, which can load the
model even from the CL. The only requirement is to add my Jar with
customizations to the classpath.