Re: English lemmatizer using wordnet

William Colen Wed, 10 Apr 2013 08:25:37 -0700

On Wed, Apr 10, 2013 at 11:22 AM, Jörn Kottmann <[email protected]> wrote:
>
> Is the memory issue is caused by the fact the dictionaries (e.g.
> POSDictionary) are using
> the Java HashMap and String key/values?
>


Yes. The dictionary I have has 800k entries. It is a huge hashmap.

Did you implement your own POSDictionary for your thesis?
>

Yes, using Morfologik FSA.

The current dictionary package has an API to read and serialize a
> dictionary from and to the
> XML format. That could be changed to some binary based format which could
> be much faster.
> But as far as I understand is the main issue we have is the representation
> of the dictionary in memory
> and not the serialization of it.


When instantiated, the dictionary XML is loaded to a hashtable. This
process takes a few seconds for a 800k entries dictionary, and depending on
the requirements it might be an issue.

I like the XML implementation, and looks like it works for most of the
OpenNLP users. But a binary option would be a plus for the ones that need
it.

I could store Morfologik FSA dictionaries to the model using the custom
factory API, so it is quite transparent for the users, which can load the
model even from the CL. The only requirement is to add my Jar with
customizations to the classpath.

Re: English lemmatizer using wordnet

Reply via email to