[ 
https://issues.apache.org/jira/browse/OPENNLP-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cohan Sujay Carlos updated OPENNLP-777:
---------------------------------------
    Attachment: topics.train

The attached training file can be used to train a Naive Bayes classifier model 
(... the training file 'topics.train' will have to be placed in a directory 
named 'corpora/topics'...) ... and the code to train and save a model looks as 
follows:

public class D1TopicClassifierTrainingDemoNB {
        public static void main(String[] args) {
                
                DoccatModel model = null;

                InputStream dataIn = null;
                try {
                  dataIn = new FileInputStream("corpora/topics/topics.train");
                  ObjectStream<String> lineStream =
                                new PlainTextByLineStream(dataIn, "UTF-8");
                  ObjectStream<DocumentSample> sampleStream = new 
DocumentSampleStream(lineStream);

                  model = DocumentCategorizerNB.train("en", sampleStream);
                }
                catch (IOException e) {
                  // Failed to read or parse training data, training failed
                  e.printStackTrace();
                }
                finally {
                  if (dataIn != null) {
                    try {
                      dataIn.close();
                    }
                    catch (IOException e) {
                      // Not an issue, training already finished.
                      // The exception should be logged and investigated
                      // if part of a production system.
                      e.printStackTrace();
                    }
                  }
                }
                String modelFile = "models/topics_nb.bin";
                OutputStream modelOut = null;
                try {
                  modelOut = new BufferedOutputStream(new 
FileOutputStream(modelFile));
                  model.serialize(modelOut);
                }
                catch (IOException e) {
                  // Failed to save model
                  e.printStackTrace();
                }
                finally {
                  if (modelOut != null) {
                    try {
                       modelOut.close();
                    }
                    catch (IOException e) {
                      // Failed to correctly save model.
                      // Written model might be invalid.
                      e.printStackTrace();
                    }
                  }
                }
        }
}

The model will be created in the directory "models" and can be loaded and used 
as follows:

public class D1TopicClassifierUsageDemoNB {
        public static void main(String[] args) {
                
                //String paragraph = "Although the outfit has been banned, no 
restriction has been imposed on movement of its leaders outside 
Pakistan-occupied-Kashmir (PoK) but they could not conduct their organisational 
activities in the country, Interior Minister Faisel Saleh Hayat said.";
                String paragraph = "Rumours before the game suggested the 
Portuguese would be out at the end of the season if Inter failed to progress 
but in the end there was little to worry about as goals from Samuel Eto'o and 
Mario Balotelli ensured a comfortable night.";
                
                // always start with a model, a model is learned from training 
data
                InputStream is = null;
                try {
                        is = new FileInputStream("models/topics_nb.bin");
                        
                        DoccatModel model = new DoccatModel(is);
                        
                        AbstractModel internalModel = 
(AbstractModel)model.getMaxentModel();
                        
                        System.out.println("ModelType: 
"+internalModel.getModelType());
                        System.out.println("Model Outcomes: ");
                        Object[] data = internalModel.getDataStructures();
                        for (String val : 
(String[])internalModel.getDataStructures()[2]) {
                                System.out.println(val);
                        }
                    IndexHashTable<String> pmap = (IndexHashTable<String>) 
data[1];
                    //String[] PRED_LABELS = new String[pmap.size()];
                    //pmap.toArray(PRED_LABELS);
                        //Context[] contexts = (Context[])data[0];
                        //System.out.println("Pred labels: ");
                        //for (String label : PRED_LABELS) {
                        //      System.out.println(label + " " + 
pmap.get(label) + " " + contexts[pmap.get(label)].getOutcomes().length + " " + 
contexts[pmap.get(label)].getOutcomes()[0] + " " + 
contexts[pmap.get(label)].getParameters()[0]);
                        //}
                        System.out.println("Running the classifier: ");
                        
                        DocumentCategorizerNB categorizer = new 
DocumentCategorizerNB(model);
                        
                        double[] results = categorizer.categorize(paragraph);

                        String bestResult = 
categorizer.getBestCategory(results);
                        
                        System.out.println(bestResult);
                        
                        results = categorizer.categorize("government");

                        bestResult = categorizer.getBestCategory(results);
                        
                        System.out.println(bestResult);
                        
                        results = categorizer.categorize("");

                        bestResult = categorizer.getBestCategory(results);
                        
                        System.out.println(bestResult);
                        
                } catch (FileNotFoundException e) {
                        e.printStackTrace();
                } catch (InvalidFormatException e) {
                        e.printStackTrace();
                } catch (IOException e) {
                        e.printStackTrace();
                } finally {
                        if (is != null) {
                                try {
                                        is.close();
                                }
                            catch (IOException e) {
                            }
                        }
                }
        }
}


> Naive Bayesian Classifier
> -------------------------
>
>                 Key: OPENNLP-777
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-777
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Machine Learning
>         Environment: J2SE 1.5 and above
>            Reporter: Cohan Sujay Carlos
>            Priority: Minor
>              Labels: NBClassifier, bayes, bayesian, classifier, multinomial, 
> naive
>         Attachments: naive-bayes-patch-for-opennlp-1.6.0-rc6.patch, 
> topics.train
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> I thought it would be nice to have a Naive Bayesian classifier in OpenNLP (it 
> lacks one at present).
> Implementation details:  We have a production-hardened piece of Java code for 
> a multinomial Naive Bayesian classifier (with default Laplace smoothing) that 
> we'd like to contribute.  The code is Java 1.5 compatible.  I'd have to write 
> an adapter to make the interface compatible with the ME classifier in 
> OpenNLP.  I expect the patch to be available 1 to 3 weeks from now.
> Below is the email trail of a discussion in the dev mailing list around this 
> dated May 19th, 2015.
> <snip>
> Tommaso Teofili via opennlp.apache.org 
> to dev 
> Hi Cohan,
> I think that'd be a very valuable contribution, as NB is one of the
> foundation algorithms, often used as basis for comparisons.
> It would be good if you could create a Jira issue and provide more details
> about the implementation and, eventually, a patch.
> Thanks and regards,
> Tommaso
> </snip>
> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos 
> > I have a question for the OpenNLP project team.
> >
> > I was wondering if there is a Naive Bayesian classifier implementation in
> > OpenNLP that I've not come across, or if there are plans to implement one.
> >
> > If it is the latter, I should love to contribute an implementation.
> >
> > There is an ME classifier already available in OpenNLP, of course, but I
> > felt that there was an unmet need for a Naive Bayesian (NB) classifier
> > implementation to be offered as well.
> >
> > An NB classifier could be bootstrapped up with partially labelled training
> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text
> > Classification from Labeled and Unlabeled Documents using EM".
> >
> > So, if there isn't an NB code base out there already, I'd be happy to
> > contribute a very solid implementation that we've used in production for a
> > good 5 years.
> >
> > I'd have to adapt it to load the same training data format as the ME
> > classifier, but I guess that shouldn't be very difficult to do.
> >
> > I was wondering if there was some interest in adding an NB implementation
> > and I'd love to know who could I coordinate with if there is?
> >
> > Cohan Sujay Carlos
> > CEO, Aiaioo Labs, India



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to