The Naive Bayes classifier is ready! I have created a patch (over OpenNLP 1.6.0 rc6).
The patch is attached to the issue we opened (issue 777) on Jira for this feature: https://issues.apache.org/jira/browse/OPENNLP-777 Here are a couple of scripts you can use to exercise the Naive Bayes classifier. You can train a model as follows: public class D1TopicClassifierTrainingDemoNB { public static void main(String[] args) { DoccatModel model = null; InputStream dataIn = null; try { dataIn = new FileInputStream("corpora/topics/topics.train"); ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); model = DocumentCategorizerNB.train("en", sampleStream); } catch (IOException e) { // Failed to read or parse training data, training failed e.printStackTrace(); } finally { if (dataIn != null) { try { dataIn.close(); } catch (IOException e) { // Not an issue, training already finished. // The exception should be logged and investigated // if part of a production system. e.printStackTrace(); } } } String modelFile = "models/topics_nb.bin"; OutputStream modelOut = null; try { modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); model.serialize(modelOut); } catch (IOException e) { // Failed to save model e.printStackTrace(); } finally { if (modelOut != null) { try { modelOut.close(); } catch (IOException e) { // Failed to correctly save model. // Written model might be invalid. e.printStackTrace(); } } } } } The training data is also attached ('topics.train') to the Jira issue. This training data has to be placed in a directory named '*corpora/topics*' under the current directory. This will create a model file named 'topics_nb.bin' in the '*models' *folder also under the current directory. The following script let's you run some classification tasks using the above model: public class D1TopicClassifierUsageDemoNB { public static void main(String[] args) { String paragraph = "Rumours before the game suggested the Portuguese would be out at the end of the season if Inter failed to progress but in the end there was little to worry about as goals from Samuel Eto'o and Mario Balotelli ensured a comfortable night."; // always start with a model, a model is learned from training data InputStream is = null; try { is = new FileInputStream("models/topics_nb.bin"); DoccatModel model = new DoccatModel(is); AbstractModel internalModel = (AbstractModel)model.getMaxentModel(); System.out.println("ModelType: "+internalModel.getModelType()); System.out.println("Model Outcomes: "); Object[] data = internalModel.getDataStructures(); for (String val : (String[])internalModel.getDataStructures()[2]) { System.out.println(val); } IndexHashTable<String> pmap = (IndexHashTable<String>) data[1]; //String[] PRED_LABELS = new String[pmap.size()]; //pmap.toArray(PRED_LABELS); //Context[] contexts = (Context[])data[0]; //System.out.println("Pred labels: "); //for (String label : PRED_LABELS) { // System.out.println(label + " " + pmap.get(label) + " " + contexts[pmap.get(label)].getOutcomes().length + " " + contexts[pmap.get(label)].getOutcomes()[0] + " " + contexts[pmap.get(label)].getParameters()[0]); //} System.out.println("Running the classifier: "); DocumentCategorizerNB categorizer = new DocumentCategorizerNB(model); double[] results = categorizer.categorize(paragraph); String bestResult = categorizer.getBestCategory(results); System.out.println(bestResult); results = categorizer.categorize("government"); bestResult = categorizer.getBestCategory(results); System.out.println(bestResult); results = categorizer.categorize(""); bestResult = categorizer.getBestCategory(results); System.out.println(bestResult); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (InvalidFormatException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { if (is != null) { try { is.close(); } catch (IOException e) { } } } } } The outcome looks mathematically correct. But I haven't done any exhaustive testing. If you can take a look at the organization of this code, and ok the same, I'll proceed with creating the verification scripts and test-cases for it. Warm regards, Cohan Sujay Carlos CEO, Aiaioo Labs +91-77605-80015 +91-80-4125-0730 On Tue, May 19, 2015 at 7:21 PM, Cohan Sujay Carlos <[email protected]> wrote: > Tommaso, > > I have created the Jira issue: > https://issues.apache.org/jira/browse/OPENNLP-777 > > The details of the Java version compatibility and the classifier's > internals are as follows: > > "Implementation details: We have a production-hardened piece of Java code > for a multinomial Naive Bayesian classifier (with default Laplace > smoothing) that we'd like to contribute. The code is Java 1.5 compatible. > I'd have to write an adapter to make the interface compatible with the ME > classifier in OpenNLP. I expect the patch to be available 1 to 3 weeks from > now." > > This is the default configuration but the code is well-refactored and you > can actually plug in any smoothing algorithm and any feature set. It also > has some support for succinct memory models, and I later plan to add a > multivariate bernoulli implementation as well (I wanted to start with the > multinomial version because the advantages of the multinomial model will > make it the better performer for most NLP projects). > > I could not figure out how to assign the issue to myself. The patch will > be available 1 to 3 weeks from now. > > Thanks and regards, > > Cohan Sujay Carlos > > > On Tue, May 19, 2015 at 5:26 PM, Tommaso Teofili < > [email protected]> wrote: > >> Hi Cohan, >> >> I think that'd be a very valuable contribution, as NB is one of the >> foundation algorithms, often used as basis for comparisons. >> It would be good if you could create a Jira issue and provide more details >> about the implementation and, eventually, a patch. >> >> Thanks and regards, >> Tommaso >> >> 2015-05-19 9:57 GMT+02:00 Cohan Sujay Carlos <[email protected]>: >> >> > I have a question for the OpenNLP project team. >> > >> > I was wondering if there is a Naive Bayesian classifier implementation >> in >> > OpenNLP that I've not come across, or if there are plans to implement >> one. >> > >> > If it is the latter, I should love to contribute an implementation. >> > >> > There is an ME classifier already available in OpenNLP, of course, but I >> > felt that there was an unmet need for a Naive Bayesian (NB) classifier >> > implementation to be offered as well. >> > >> > An NB classifier could be bootstrapped up with partially labelled >> training >> > data as explained in the Nigam, McCallum, et al paper of 2000 "Text >> > Classification from Labeled and Unlabeled Documents using EM". >> > >> > So, if there isn't an NB code base out there already, I'd be happy to >> > contribute a very solid implementation that we've used in production >> for a >> > good 5 years. >> > >> > I'd have to adapt it to load the same training data format as the ME >> > classifier, but I guess that shouldn't be very difficult to do. >> > >> > I was wondering if there was some interest in adding an NB >> implementation >> > and I'd love to know who could I coordinate with if there is? >> > >> > Cohan Sujay Carlos >> > CEO, Aiaioo Labs, India >> > +91-77605-80015 +91-80-4125-0730 >> > >> > >
