I have gone ahead and written the test-cases and verified that the Naive
Bayes Classifier works correctly.
Here is the latest patch (attached) with the test-cases and everything.
In implementing the Naive Bayes classifier, we tried to *ensure minimal
disruption* to existing code.
The *only* changes to existing code are as follows:
1. The opennlp.tools.ml.model.AbstractModel class has been changed to
include a new model type:
line 35: *public enum ModelType **{Maxent,Perceptron,MaxentQn,NaiveBayes};*
2. The opennlp.tools.ml.model.GenericModelReader class has been changed in
one place:
line 53:
*else if (modelType.equals("NaiveBayes")) **{ delegateModelReader = new
NaiveBayesModelReader(this.dataReader); }*
3. The opennlp.tools.ml.model.GenericModelWriter class has been changed in
two places:
line 79:
*if (model.getModelType() == ModelType.NaiveBayes) **{ delegateWriter = new
BinaryNaiveBayesModelWriter(model,dos); }*
line 91:
*if (model.getModelType() == ModelType.NaiveBayes) **{ delegateWriter = new
PlainTextNaiveBayesModelWriter(model,bw); }*
4. The initializer of the opennlp.tools.ml.TrainerFactory class has been
changed in one place to add the Naive Bayes trainer:
line 51:
*_trainers.put(NaiveBayesTrainer.NAIVE_BAYES_VALUE,
NaiveBayesTrainer.class);*
That was it!
We didn't change anything else in the existing OpenNLP code.
All the new code for the Naive Bayesian classifier sits in the package
opennlp.tools.ml.naivebayes - just above the perceptron
The code for the document categorizer using the Naive Bayesian classifier
can be found in opennlp.tools.doccat (we didn't have to change any existing
code). The new doccat is called opennlp.tools.doccat.DocumentCategorizerNB
(reflecting the name of the maxent document categorizer, which is
DocumentCategorizerME).
Proof of correctness!
I have included two testcases:
1. A test to validate the document categorizer - under the tests folder,
you will find opennlp.tools.doccat.DocumentCategorizerNBTest - which runs
the same tests that were run on the ME document categorizer, but on the
Naive Bayes categorizer instead (all tests passed).
2. A test to check the mathematical correctness of the Naive Bayes
implementation can be found in
opennlp.tools.ml.naivebayes.NaiveBayesCorrectnessTest.
So, the inclusion of this code will minimally impact any existing code.
And the code in this patch contains a multinomial Naive Bayesian classifier
that is verifiably correct.
Is there anything else I have to do to have this patch pulled into the
OpenNLP code base (for say 1.7.0)?
Cohan Sujay Carlos
CEO, Aiaioo Labs
+91-77605-80015
On Tue, May 19, 2015 at 7:21 PM, Cohan Sujay Carlos <[email protected]>
> wrote:
>
>> Tommaso,
>>
>> I have created the Jira issue:
>> https://issues.apache.org/jira/browse/OPENNLP-777
>>
>>