On Wed, Apr 10, 2013 at 11:22 AM, Jörn Kottmann <[email protected]> wrote: > > Is the memory issue is caused by the fact the dictionaries (e.g. > POSDictionary) are using > the Java HashMap and String key/values? >
Yes. The dictionary I have has 800k entries. It is a huge hashmap. Did you implement your own POSDictionary for your thesis? > Yes, using Morfologik FSA. The current dictionary package has an API to read and serialize a > dictionary from and to the > XML format. That could be changed to some binary based format which could > be much faster. > But as far as I understand is the main issue we have is the representation > of the dictionary in memory > and not the serialization of it. When instantiated, the dictionary XML is loaded to a hashtable. This process takes a few seconds for a 800k entries dictionary, and depending on the requirements it might be an issue. I like the XML implementation, and looks like it works for most of the OpenNLP users. But a binary option would be a plus for the ones that need it. I could store Morfologik FSA dictionaries to the model using the custom factory API, so it is quite transparent for the users, which can load the model even from the CL. The only requirement is to add my Jar with customizations to the classpath.
