Hi all,
we spoke about it here and there already, to ensure that OpenNLP can
stay competitive with other NLP libraries I am proposing to make the
machine learning pluggable.
The extensions should not make it harder to use OpenNLP, if a user loads
a model OpenNLP should be capable of setting up everything by itself
without forcing the user to write custom integration code based on the
ml implementation.
We solved this problem already with the extension mechanism, we build to
support the customization of our components, I suggest that we reuse
this extension mechanism to load a ml implementation. To use a custom ml
implementation the user has to specify the class name of the factory in
the Algorithm field of the params file. The params file is available
during training and tagging time.
Most components in the tools package use the maxent library to do
classification. The Java interfaces for this are currently located in
the maxent package, to be able to swap the implementation the interfaces
should be defined inside the tools package. To make things easier I
propose to move the maxent and perceptron implemention as well.
Through the code base we use the AbstractModel, thats a bit unlucky
because the only reason for this is the lack of model serialization
support in the MaxentModel interface, a serialization method should be
added to it, and maybe renamed to ClassificationModel. This will
break backward compatibility in non-standard use cases.
To be able to test the extension mechanism I suggest that we implement
an addon which integrates liblinear and the Apache Mahout classifiers.
There are still a few deprecated 1.4 constructors and methods in OpenNLP
which directly reference interfaces and classes in the maxent library,
these need to be removed, to be able to move the interfaces to the tools
package.
Any opinions?
Jörn