Hi Filipponi,
This case testnb will not work. As in the end part of it code its takes
label to print the confusion matrix.

if you want to use your Model to predict what are the possible out come,
you have to use the class "TestNaiveBayesDriver.java"  to write that.

and comment this section /*if (bestIdx != Integer.MIN_VALUE) {
        ClassifierResult classifierResult = new
ClassifierResult(labelMap.get(bestIdx), bestScore);
        analyzer.addInstance(pair.getFirst().toString(), classifierResult);
      }
*/
 that case the output file of BayesTestMapper is the going to store values
for you if you can use seqdumper you can get the values for key
"471685156584292353".
or suppose

Key: /471685156584292353/ Value:/471685156584292353/:{1:
0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
    NaiveBayesModel model =NaiveBayesModel.materialize(output, conf); //
output path of Model
    classifier = new ComplementaryNaiveBayesClassifier(model);
    classifier.classifyFull(vector); // this returns A vector of
probabilities in 1 of n-1 encoding for your label. input will be the vector
{1:
0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
}
Thanks
Vaibhav.








On Tue, Jul 29, 2014 at 9:06 PM, Luca Filipponi <luca.filippon...@gmail.com>
wrote:

> I appreciate your help, but for my lack of knowledge I didn't understand.
>
> I'll try to explain better my problem :D
>
> What I've done is to create a sequence File starting from csv like this (
> is italian tweet :D ):
>
> negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... Per
> adesso io rido !!!!!
>
> positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si preoccupa
> di chi gli garantisce voti...ma stanno scoprendo il prezzo di quei
> fottutissimi #80euro dagli ...
>
> neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di altre
> schifezze... http://t.co/euFbtP7hQ1 ... #Europee2014 via
>
> So I create a sequence file in this way:
>
>
> String[] tokens = line.split(",", 3);
>
>             String label = tokens[0];
>             String id = tokens[1];
>             String message = tokens[2];
>             key.set("/" + label + "/" + id);
>             value.set(message);
>             writer.append(key, value);
>
>
> So I'm creating a sequence File of the form <Text,Text> where the key is
> composed in this way : "/label/documentID/" and the value contains the
> original text of the document.
>
> After this step I create tfidf document using mahout utilities, then I've
> a sequence file Text,VectorWritable like this:
>
> Key: /negativo/468437278663409666
> Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.2884088933275849,241:0.2772479861583959,309:0.22061363650715415}
>
> Then I am using the command on the newly created vector:
>
> ./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c
>
> And then:
>
> ./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o
> trainingVectorTest-result -c
>
> and this is the output:
>
> 14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary Results:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :        112    99,115%
> Incorrectly Classified Instances        :          1    0,885%
> Total Classified Instances              :        113
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a    b    c    <--Classified as
> 47   0    0     |  47    a     = negativo
> 0    41   0     |  41    b     = neutrale
> 0    1    24    |  25    c     = positivo
>
> =======================================================
> Statistics
> -------------------------------------------------------
> Kappa                                       0,9361
> Accuracy                                    99,115%
> Reliability                                     74%
> Reliability (standard deviation)            0,4937
>
>
> What I want to do now is to use the classifier on a new dataset that is
> unlabeled, so I've a csv like this:
>
> 471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso io
> rido !!!!!
>
> So I wrote a sequence file with:
>
> key= /documentid/ value= Content of the document
>
> and then use mahout utilities to create a tfidf-vector:
>
> Key: /471685156584292353/
> Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086
> ...
>
> But when I use the command testnb on this new dataset I get this exception:
>
> java.lang.IllegalArgumentException: Label not found: 471685156584292353
>
> I know that this is due, to the fact that the documentID is recognized as
> label, but I don't know how to resolve that, could be great if you provide
> me some similar example, becouse I can't find nothing similar.
>
> Thank you so much in advance, your help is really appreciated.
>
> Luca Filipponi.
>
>
> Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava <
> vaibhavcs...@gmail.com> ha scritto:
>
> > Hi
> > The sequence file format will be Text and Vector Writable.
> > suppose you have test document named as 1,2,3,4.
> > The you can have sequence file format as Key : /test/1 Value : <vectors1>
> > /test/2 Value : <vectors2>
> >
> > this line in BayesTestMapper
> > //the key is the expected value
> >
> >    context.write(new Text(SLASH.split(key.toString())[1]), new
> > VectorWritable(result));
> >
> >
> > and TestNaiveBayesDriver.java might help you . if you remove this part
> from
> > this code  you will not get confusion matrix  and initial labels are not
> > required.
> >
> >
> >
> >
> > if (bestIdx != Integer.MIN_VALUE) {
> >
> >        ClassifierResult classifierResult = new ClassifierResult(labelMap
> > .get(bestIdx), bestScore);
> >
> >        analyzer.addInstance(pair.getFirst().toString(),
> classifierResult);
> >
> >      }
> >
> >
> > your out file will contain our document name suppose 1 and label vector
> > with its values.
> >
> >
> > hope this help.
> >
> > Thanks,
> >
> > Vaibhav
> >
> > vaibhavcs...@gmail.com
> >
> >
> >
> >
> > On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <
> luca.filippon...@gmail.com>
> > wrote:
> >
> >> I am using mahout 0.9, which part of source code should I look?
> >>
> >> My problem is that I don't know how to the sequence file without the
> label
> >> should be structured.
> >>
> >> Do you have any hint?
> >>
> >> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <
> >> vaibhavcs...@gmail.com> ha scritto:
> >>
> >>> Hi,
> >>> If you want to create a test set and if you do not want to measure
> >> accuracy.
> >>> Then you can make an instance of claasifier and load your model on that
> >>> classifier and then can find the best score.
> >>> Look at  navie bayes test code.
> >>> Hope this help. Thanks .
> >>> On 29 Jul 2014 12:53, "Luca Filipponi" <luca.filippon...@gmail.com>
> >> wrote:
> >>>
> >>>> Hi , I am trying to develop sentiment analysis on italian tweet from
> >>>> twitter using the naive bayes classifier, but I've some trouble.
> >>>>
> >>>> My idea was to classify a lot of tweet as positive, negative or
> >> neautral,
> >>>> and using that as training set for the Classifier. To do that I've
> >> wrote a
> >>>> sequence file, in the format <Text,Text>, where in the key there is
> >>>> /label/tweetID and in the key the text, and then the text of all the
> >>>> dataset is converted in tfidf vector, using mahout utilities.
> >>>>
> >>>> Then I'm using the command:
> >>>>
> >>>> ./mahout trainnb and ./mahout testnb to check the classifier, and the
> >>>> score is right (I've got nearly 100% because the test set is the same
> as
> >>>> the train set)
> >>>>
> >>>> My question is if I want to use a test set that is unlabeled how
> should
> >> it
> >>>> be created? because if the format isn't like:
> >>>>
> >>>> key = /label/  the classifier can't find the label and I've got an
> >>>> exception
> >>>>
> >>>> but in a new dataset, obviously this will be unlabeled because i need
> to
> >>>> classify that, so I don't know what put in the key of the sequence
> file.
> >>>>
> >>>> I've searched online for some example, but the only ones that I've
> found
> >>>> use the split command, on the original dataset, and then testing on
> >> part of
> >>>> that, but isn't my case.
> >>>>
> >>>>
> >>>> Every idea for developing a better sentiment analysis is welcome,
> thanks
> >>>> in advance for the help.
> >>>>
> >>>>
> >>
> >>
> >
> >
> > --
> > Thanks and Regards,
> > Vaibhav Srivastava
> > Email-id: vaibhavcs...@gmail.com
> > Mobile no.: 9552543029
>
>


-- 
Thanks and Regards,
Vaibhav Srivastava
Email-id: vaibhavcs...@gmail.com
Mobile no.: 9552543029

Reply via email to