Hello,
thanks for the detailed view about the different implementations,
if possible it would probably make sense to have all three options?
Option 3 is especially nice for people concerned about throughput,
where option 2 seems to be really nice for people who want to try
things which work out of the box and are easy to get to work.
Option 1 would be nice to have for the coref component.
I suggest we define an interface for the Lemmatizer Dictionary and then
provide different implementations for it, similar as it can be done for
the POSTagger,
have a look at TagDictionary. The dictionary package contains util
classes which can
be used to serialize/deserialize to our xml format.
Jörn
On 04/15/2013 06:15 PM, Rodrigo Agerri wrote:
Hello again,
I have now working three different implementations for lemmatization
using a dictionary for English and Spanish:
1. As mentioned in first email, using Princeton WordNet and JWNL.
2. Loading dictionary from plain text and perform HashMap .get on the
dictionary to obtain the lemmas
3. Loading Morfologik dictionary and perform HashMap .get on a mini
HashMap dictionary created from the Morfologik dictionary lookup.
In the project there seem to be several Dictionary implementations. I
can try to use any of those (in particular, the POSDictionary seems
the most relevant for the task at hand), but in any case I think that
for lemmatization the implementation can be very simple. We only need
a plain text tab separated Word Postag Lemma dictionary and a
Dictionary interface that defines a lemmatize(String word, String
postag) method.
Then each of the approaches above implement this interface to create 3
different lemmatizers.
1. JWNLemmatizer constructs a JWNLDictionary (as in Coref OpenNLP
package) and then uses JWNL API to obtain the lemmas.
2. A SimpleLemmatizer using a plain text tab separated dictionary in
the form Word Postag Lemma that is loaded into a
HashMap<List<String>,String>dictionary. If the dictionary is large,
this is where the performance in terms of memory and speed seems to be
most affected. The lemmatization performs .get operations on the
HashMap to obtain the lemma.
3. A MorfologikLemmatizer using a binary dictionary created from the
plain text tab separated dictionary of previous point. Morfologik
DictionaryLookup is used to obtain the associated lemmas and postags
for each word. From this a HashMap dictionary is created and .get
obtains the lemma for each word,postag pair. Loading the Morfologik
dictionary is very fast and it is memory cheap.
The dictionaries used in 2. and 3. come from those distributed by
LanguageTool. Using approach 2. is dependent free and 3. requires
Morfologik stemming to create and access the dictionary. Furthermore,
the LanguageTool dictionaries are more complete than WordNet and
easier to maintain for lemmatization.
I have measured the execution time and looked at memory resources
while analyzing a standard article from Guardian for English and
another one from a Spanish newspaper. You can look at the numbers
below. Using morfologik dictionary is the fastest and memory cheap
although in a good machine the differences in speed get quite smaller.
What do you think?
Cheers,
Rodrigo
English POS and lemmatization:
- English dictionary 350K entries:
+ Plain text: 7.9MB
+ Morfologik: 1.2MB
$ wc english.txt
33 684 4225
Spanish POS and lemmatization:
- Dictionary: 650K entries
+ Plain text: 19M
+ Morfologik: 600K
$ wc spanish.txt
52 947 5857
DELL Optiplex 790, Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz, 16GB RAM
-------------------------------------------------------------------------------------------------------------
ENGLISH
--------------
[JWNL and WordNet]
real 0m1.652s
user 0m2.063s
sys 0m0.215s
[plain text hashmap lookup]
real 0m1.237s
user 0m2.113s
sys 0m0.122s
RAM 250MB ~
[morfologik stemming]
real 0m1.134s
user 0m1.869s
sys 0m0.090s
RAM 40MB ~
SPANISH
--------------
[plain text hashmap lookup]
real 0m2.764s
user 0m9.654s
sys 0m1.022s
400 MB RAM
[morfologik stemming]
real 0m1.209s
user 0m2.043s
sys 0m0.093s
40 MB RAM
NETBOOK Acer Aspire One AMD Dual-Core C60 APU, 4GB RAM
-----------------------------------------------------------------------------------------------
ENGLISH
--------------
[using JWNL and WordNet]
real 0m9.410s
user 0m12.105s
sys 0m1.004s
[using plain text and hashmap lookup]
real 0m10.273s
user 0m13.857s
sys 0m0.540s
[using morfologik stemming]
real 0m7.834s
user 0m10.977s
sys 0m0.556s
SPANISH
--------------
[plain text hashmap lookup]
real 0m17.233s
user 0m23.597s
sys 0m1.320s
RAM 400MB ~
[morfologik stemming]
real 0m8.107s
user 0m12.181s
sys 0m0.408s
RAM 40MB ~
On Fri, Apr 12, 2013 at 4:13 PM, Rodrigo Agerri <[email protected]> wrote:
Sure,
I am looking at it.
Cheers,
Rodrigo
On Fri, Apr 12, 2013 at 3:36 PM, Aliaksandr Autayeu
<[email protected]> wrote:
I do not know yet which dictionary format will be best, but I can try
to come up with a proposal independent of WordNet or other third party
resources, when I have it working, and then discuss it.
Rodrigo, it would be nice if you write as soon as you have come up with
data structures, before implementation. This will allow to take more
languages into account.