A small update to the patch (I removed a superfluous piece of code). In the earlier path, I had used a subclass of opennlp.tools.doccat.DoccatModel called opennlp.tools.doccat.DoccatModelNB that was functionally identical. I removed that subclass since it wasn't essential (DoccatModel does the trick just fine).
Is there anything else I need to do? Is someone on the dev team going to be responsible for incorporating the patch into the codebase? Can I mark this Jira issue fixed (for target 1.6.1?). Cohan Sujay Carlos CEO, Aiaioo Labs +91-77605-80015 On Sat, Jul 18, 2015 at 6:02 PM, Cohan Sujay Carlos <[email protected]> wrote: > I have gone ahead and written the test-cases and verified that the Naive > Bayes Classifier works correctly. > > Here is the latest patch (attached) with the test-cases and everything. > > In implementing the Naive Bayes classifier, we tried to *ensure minimal > disruption* to existing code. > > The *only* changes to existing code are as follows: > > 1. The opennlp.tools.ml.model.AbstractModel class has been changed to > include a new model type: > > line 35: *public enum ModelType * > *{Maxent,Perceptron,MaxentQn,NaiveBayes};* > > 2. The opennlp.tools.ml.model.GenericModelReader class has been changed > in one place: > > line 53: > *else if (modelType.equals("NaiveBayes")) **{ delegateModelReader = new > NaiveBayesModelReader(this.dataReader); }* > > 3. The opennlp.tools.ml.model.GenericModelWriter class has been changed > in two places: > > line 79: > *if (model.getModelType() == ModelType.NaiveBayes) **{ delegateWriter = > new BinaryNaiveBayesModelWriter(model,dos); }* > > line 91: > *if (model.getModelType() == ModelType.NaiveBayes) **{ delegateWriter = > new PlainTextNaiveBayesModelWriter(model,bw); }* > > 4. The initializer of the opennlp.tools.ml.TrainerFactory class has been > changed in one place to add the Naive Bayes trainer: > > line 51: > *_trainers.put(NaiveBayesTrainer.NAIVE_BAYES_VALUE, > NaiveBayesTrainer.class);* > > That was it! > > We didn't change anything else in the existing OpenNLP code. > > All the new code for the Naive Bayesian classifier sits in the package > opennlp.tools.ml.naivebayes - just above the perceptron > > The code for the document categorizer using the Naive Bayesian classifier > can be found in opennlp.tools.doccat (we didn't have to change any > existing code). The new doccat is called > opennlp.tools.doccat.DocumentCategorizerNB (reflecting the name of the > maxent document categorizer, which is DocumentCategorizerME). > > Proof of correctness! > > I have included two testcases: > > 1. A test to validate the document categorizer - under the tests folder, > you will find opennlp.tools.doccat.DocumentCategorizerNBTest - which runs > the same tests that were run on the ME document categorizer, but on the > Naive Bayes categorizer instead (all tests passed). > > 2. A test to check the mathematical correctness of the Naive Bayes > implementation can be found in > opennlp.tools.ml.naivebayes.NaiveBayesCorrectnessTest. > > So, the inclusion of this code will minimally impact any existing code. > > And the code in this patch contains a multinomial Naive Bayesian > classifier that is verifiably correct. > > Is there anything else I have to do to have this patch pulled into the > OpenNLP code base (for say 1.7.0)? > > Cohan Sujay Carlos > CEO, Aiaioo Labs > +91-77605-80015 > > On Tue, May 19, 2015 at 7:21 PM, Cohan Sujay Carlos <[email protected]> >> wrote: >> >>> Tommaso, >>> >>> I have created the Jira issue: >>> https://issues.apache.org/jira/browse/OPENNLP-777 >>> >>>
