[ 
https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632413#comment-14632413
 ] 

Cohan Sujay Carlos commented on OPENNLP-777:
--------------------------------------------

In implementing the Naive Bayes classifier, we tried to ensure minimal 
disruption to existing code.

The only changes to existing code are as follows:

1.  The opennlp.tools.ml.model.AbstractModel class has been changed to include 
a new model type:

line 35:  public enum ModelType {Maxent,Perceptron,MaxentQn,NaiveBayes};

2.  The opennlp.tools.ml.model.GenericModelReader class has been changed in one 
place:

line 53:
    else if (modelType.equals("NaiveBayes")) {
        delegateModelReader = new NaiveBayesModelReader(this.dataReader);
    }

3.  The opennlp.tools.ml.model.GenericModelWriter class has been changed in two 
places:

line 79:
    if (model.getModelType() == ModelType.NaiveBayes) {
        delegateWriter = new BinaryNaiveBayesModelWriter(model,dos);
    }

line 91:
    if (model.getModelType() == ModelType.NaiveBayes) {
        delegateWriter = new PlainTextNaiveBayesModelWriter(model,bw);
    }

4.  The initializer of the opennlp.tools.ml.TrainerFactory class has been 
changed in one place to add in the built-in Naive Bayes trainer:

line 51:
    _trainers.put(NaiveBayesTrainer.NAIVE_BAYES_VALUE, NaiveBayesTrainer.class);

That was it!

We didn't change anything else in the existing OpenNLP code.

All the new code for the Naive Bayesian classifier sits in the package 
opennlp.tools.ml.naivebayes - just above the perceptron ;)

The code for the document categorizer using the Naive Bayesian classifier can 
be found in opennlp.tools.doccat (we didn't have to change any existing code).  
The new doccat is called opennlp.tools.doccat.DocumentCategorizerNB (reflecting 
the name of the maxent document categorizer, which is DocumentCategorizerME).

Proof of correctness!

I have included two testcases:

1.  A test to validate the document categorizer - under the tests folder, you 
will find opennlp.tools.doccat.DocumentCategorizerNBTest - which runs the same 
tests that were run on the ME document categorizer, but on the Naive Bayes 
categorizer instead (all tests passed).

2.  A test to check the mathematical correctness of the Naive Bayes 
implementation can be found in 
opennlp.tools.ml.naivebayes.NaiveBayesCorrectnessTest.

So, the inclusion of this code will minimally impact any existing code.

And the code in the latest patch is verifiably correct.

> Naive Bayesian Classifier
> -------------------------
>
>                 Key: OPENNLP-777
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-777
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Machine Learning
>         Environment: J2SE 1.5 and above
>            Reporter: Cohan Sujay Carlos
>            Priority: Minor
>              Labels: NBClassifier, bayes, bayesian, classifier, multinomial, 
> naive
>         Attachments: 
> naive-bayes-classifier-for-opennlp-1.6.0-rc6-with-test-cases.patch, 
> naive-bayes-patch-for-opennlp-1.6.0-rc6.patch, topics.train
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
> lacks one at present).
> Implementation details:  We have a production-hardened piece of Java code for 
> a multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
> we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
> an adapter to make the interface compatible with the ME classifier in 
> OpenNLP.  I expect the patch to be available 1 to 3 weeks from now.
> Below is the email trail of a discussion in the dev mailing list around this 
> dated May 19th, 2015.
> <snip>
> Tommaso Teofili via opennlp.apache.org 
> to dev 
> Hi Cohan,
> I think that'd be a very valuable contribution, as NB is one of the
> foundation algorithms, often used as basis for comparisons.
> It would be good if you could create a Jira issue and provide more details
> about the implementation and, eventually, a patch.
> Thanks and regards,
> Tommaso
> </snip>
> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 
> > I have a question for the OpenNLP project team.
> >
> > I was wondering if there is a Naive Bayesian classifier implementation in
> > OpenNLP that I've not come across, or if there are plans to implement one.
> >
> > If it is the latter, I should love to contribute an implementation.
> >
> > There is an ME classifier already available in OpenNLP, of course, but I
> > felt that there was an unmet need for a Naive Bayesian (NB) classifier
> > implementation to be offered as well.
> >
> > An NB classifier could be bootstrapped up with partially labelled training
> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> > Classification from Labeled and Unlabeled Documents using EM".
> >
> > So, if there isn't an NB code base out there already, I'd be happy to
> > contribute a very solid implementation that we've used in production for a
> > good 5 years.
> >
> > I'd have to adapt it to load the same training data format as the ME
> > classifier, but I guess that shouldn't be very difficult to do.
> >
> > I was wondering if there was some interest in adding an NB implementation
> > and I'd love to know who could I coordinate with if there is?
> >
> > Cohan Sujay Carlos
> > CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to