On 05/02/2012 07:00 PM, [email protected] wrote:
Hi,
I am thinking of adding a new feature for the POS Tagger component and I
would appreciate some comments.
POS Tagger effectiveness increases a lot with a POSDictionary, but today
the only option is to provide one. It would be nice if we could induce the
dictionary from training data, or expand the existing dictionary with the
training data.
To activate that the user could pass in a cutoff value. Only word + tag
with frequency higher than the cutoff should be added to the dictionary.
While performing cross validation we should keep in mind that we can only
expand / create a dictionary using the training portion of the corpus.
The only problem I see now is how we should create / expand this dictionary
if we are using the new Factory mechanism. One issue is that the tools can
not access the dictionary directly, also, depending on the dictionary
implementation we are using, maybe the factory itself should perform the
task of populating it. The base Factory implementation should implement it
for the default POSDictionary.
In this case, I would add the following methods to the POSTaggerFactory:
1) expandPOSDictionary( TrainingSampleStream<POSSample> samples, Integer
cutoff, boolean keepOriginal );
This method would expand / create the dictionary using the data from
samples, respecting the cutoff. The argument keepOriginal is used to inform
the implementation that it should backup the original dictionary
2) restorePOSDictionary();
Restores the dictionary backup to start another cross-validation
What do you think? I am not sure this feature would help others, also I
don't like the POSTaggerFactory to take this responsibility, but I can't
see a cleaner option right now.
Well, what you need is a mutable dictionary.
A user who provides a custom dictionary must also provide
support for serialization of it. He could decide to implement an
interface to make the dictionary mutable (just an option how that could
be done)
In the cross validation case I would just create a new one with
the help of the factory from the original data.
What do you think?
Jörn