Hi, first of all I'm sorry that my previous mail was vague and poorly formulated. Yes, Suneel got exactly what I was asking.Both options will address my requirement. Thanks a lot. -Tharindu On Mar 19, 2014 8:51 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
> Tharindu, > > If I understand what u r trying to do:- > > a) You have a trained Bayes model. > b) You would like to classify new documents using this trained model. > c) You were trying to use TestNaiveBayesDriver to classify the documents > in (b). > > Option 1: > ----------- > > You could write a custom MapReduce job that creates sequence files from > the documents (without the label key). You could feed these sequencefiles > to seq2sparse to generate ur vectors -> call TestNAiveBayes with this > input. Let me know if u need code for the earlier part. > > > Option 2: > ----------- > Work with your existing tf-idf vectors generated from seqdirectory -> > seq2sparse. But instead of invoking Mahout TestNaiveBayes, create a custom > MapReduce job (or a plain java program if that's fine with u) that > basically does the following: > > a) Instantiate a classifier with trained model: (Pseudo code below) > > NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new > Path(outputDir.getAbsolutePath()), conf); > > AbstractVectorClassifier classifier = new > StandardNaiveBayesClassifier(naiveBayesModel); > > // Parse through the input tf-idf vectors <Text, VectorWritable> and feed > them to the classifier > > for (Pair<Text,VectorWritable> vector : new > SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST, > PathFilters.logsCRCFilter(), null, true, conf)) { > // invoke the classifier on the incoming vector > Vector result = classifier.classifyFull(vector.getSecond().get()); > context.write(record.getFirst(), new VectorWritable(result)); > } > > You can have the above code as part of a mapper in an MR job. > > > > > > > > > > On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <kevinmoul...@gmail.com> > wrote: > > To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the > key formatted like this "label/label" for some reason I checked with the > sources to be sure and it parses it looking for a '/'. > > When y used seqdirectory, it told Naive Bayes to classify the content of > each file (ex : file1.txt) with the label corresponding to its name (here, > file1.txt). So when you tried testing with input0.txt it failed because > input0.txt was not a valid label. > > I designed a MapReduce java job that transforms a csv with numeric values > into a proper SequenceFile, if you want you can take the source and update > if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils > > Good luck. > > Kévin Moulart > > > > 2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>: > > > Hi Tharindu, > > > > If I understand correctly seqdirectory creates labels based on the file > > name but this is not what you want. What do you want the labels to be? > > > > Cheers, > > > > Frank > > > > > > On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira > > <tharindurus...@gmail.com>wrote: > > > > > Hi everyone, > > > I'm developing an application where I need to train a Naive Bayes > > > classification model and use this model to classify new entities(In > this > > > case text files based on their content) > > > > > > I observed that seqdirectory command always adds the file/directory > name > > as > > > the "key" field for each document which will be used as the label in > > > classification jobs. > > > This makes sense when I need to train a model and create the labelindex > > > since I have organized my training data according to their labels in > > > separate > directories. > > > > > > Now I'm trying to use this model and infer the best label for an > unknown > > > document. > > > My requirement is to ask Mahout to read my new file and output the > > > predicted category by looking at the labelindex and the tfidf vector of > > the > > > new content. > > > I tried creating vectors from the new content (seqdirectory and > > > seq2sparse), and then using this vector to run testnb command. But > > > unfortunately seqdirectory commands adds file names as labels which > does > > > not make sense in classification. > > > > > > The following error message will further demonstrate this behavior. > > > imput0.txt is the file name of my new document. > > > > > > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while > > > classifying documents > > > java.lang.IllegalArgumentException: Label not found: input0.txt > > > at > > > > > > com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) > > > at > > > > > > > > > org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182) > > > at > > > > > > > > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205) > > > at > > > > > > > > > > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209) > > > at > > > > > > > > > org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173) > > > at > > > > > > > > > org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70) > > > at > > > > > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160) > > > at > > > > > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125) > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > at > > > > > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66) > > > > > > > > > So how can I achieve what I'm trying to do here? > > > > > > Thanks, > > > > > > > > > -- > > > M.P. Tharindu Rusira Kumara > > > > > > Department of Computer Science and Engineering, > > > University of Moratuwa, > > > Sri Lanka. > > > +94757033733 > > > www.tharindu-rusira.blogspot.com > > > > >