Hi
The sequence file format will be Text and Vector Writable.
suppose you have test document named as 1,2,3,4.
The you can have sequence file format as Key : /test/1 Value : <vectors1>
/test/2 Value : <vectors2>

this line in BayesTestMapper
//the key is the expected value

    context.write(new Text(SLASH.split(key.toString())[1]), new
VectorWritable(result));


and TestNaiveBayesDriver.java might help you . if you remove this part from
this code  you will not get confusion matrix  and initial labels are not
required.




if (bestIdx != Integer.MIN_VALUE) {

        ClassifierResult classifierResult = new ClassifierResult(labelMap
.get(bestIdx), bestScore);

        analyzer.addInstance(pair.getFirst().toString(), classifierResult);

      }


your out file will contain our document name suppose 1 and label vector
with its values.


hope this help.

Thanks,

Vaibhav

vaibhavcs...@gmail.com




On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <luca.filippon...@gmail.com>
wrote:

> I am using mahout 0.9, which part of source code should I look?
>
> My problem is that I don't know how to the sequence file without the label
> should be structured.
>
> Do you have any hint?
>
> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <
> vaibhavcs...@gmail.com> ha scritto:
>
> > Hi,
> > If you want to create a test set and if you do not want to measure
> accuracy.
> > Then you can make an instance of claasifier and load your model on that
> > classifier and then can find the best score.
> > Look at  navie bayes test code.
> > Hope this help. Thanks .
> > On 29 Jul 2014 12:53, "Luca Filipponi" <luca.filippon...@gmail.com>
> wrote:
> >
> >> Hi , I am trying to develop sentiment analysis on italian tweet from
> >> twitter using the naive bayes classifier, but I've some trouble.
> >>
> >> My idea was to classify a lot of tweet as positive, negative or
> neautral,
> >> and using that as training set for the Classifier. To do that I've
> wrote a
> >> sequence file, in the format <Text,Text>, where in the key there is
> >> /label/tweetID and in the key the text, and then the text of all the
> >> dataset is converted in tfidf vector, using mahout utilities.
> >>
> >> Then I'm using the command:
> >>
> >> ./mahout trainnb and ./mahout testnb to check the classifier, and the
> >> score is right (I've got nearly 100% because the test set is the same as
> >> the train set)
> >>
> >> My question is if I want to use a test set that is unlabeled how should
> it
> >> be created? because if the format isn't like:
> >>
> >> key = /label/  the classifier can't find the label and I've got an
> >> exception
> >>
> >> but in a new dataset, obviously this will be unlabeled because i need to
> >> classify that, so I don't know what put in the key of the sequence file.
> >>
> >> I've searched online for some example, but the only ones that I've found
> >> use the split command, on the original dataset, and then testing on
> part of
> >> that, but isn't my case.
> >>
> >>
> >> Every idea for developing a better sentiment analysis is welcome, thanks
> >> in advance for the help.
> >>
> >>
>
>


-- 
Thanks and Regards,
Vaibhav Srivastava
Email-id: vaibhavcs...@gmail.com
Mobile no.: 9552543029

Reply via email to