[
https://issues.apache.org/jira/browse/MAHOUT-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000838#comment-13000838
]
Robin Swezey commented on MAHOUT-605:
-------------------------------------
Hello, this is Robin S.
I am sorry, it looks like that I was not really clear in my post and comments.
I will reformulate in this comment with a more understandable example, to state
the problem we believe we are seeing.
We have uploaded all the files and logs of this example there:
1. Training CBayes tutorial log http://pastebin.com/5N7cQsKU
2. Testing Bayes tutorial log http://pastebin.com/Q4XscCgz
3. Testing CBayes tutorial log http://pastebin.com/F7rBReag
4. Our testing class for weights http://pastebin.com/VMUVGmUd
5. Output of the testing class http://pastebin.com/LPLZ6LRA
A/
Using the tutorial on https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html we
have:
- generated train and test datasets for the 20 newsgroups example
- trained them on a Hadoop cluster (8 nodes) with the following command: (see
file 1)
{quote}
$MAHOUT_HOME/bin/mahout trainclassifier -i scope/run_24/train-input -o
scope/run_24/model -type cbayes -ng 1 -source hdfs &>
../logs/run_24/train_output.txt
{quote}
- tested them twice locally with the following commands: (see file 2) (see file
3)
{quote}
hadoop dfs -get scope/run_24/model ./
export HADOOP_HOME=""
$MAHOUT_HOME/bin/mahout testclassifier -m model -d test-input -type bayes -ng
1 -source hdfs -method sequential &> ../logs/run_24/test_output.txt
{quote}
(this is a Bayes test, which according to our own Mahout tests and what Robin A
was saying, can be done with a CBayes model)
{quote}
$MAHOUT_HOME/bin/mahout testclassifier -m model -d test-input -type cbayes -ng
1 -source hdfs -method sequential &> ../logs/run_25/test_output.txt
{quote}
(a CBayes test, we use the same trained model and test input as run_24 just
above)
Here are the outputs:
{quote}
Bayes-test of CBayes-trained classifier on 20-newsgroups: (see file 2)
Correctly Classified Instances : 6003 79.6999%
Incorrectly Classified Instances : 1529 20.3001%
Total Classified Instances : 7532
{quote}
{quote}
CBayes-test of CBayes-trained classifier on 20-newsgroups: (see file 3)
Correctly Classified Instances : 6401 84.9841%
Incorrectly Classified Instances : 1131 15.0159%
Total Classified Instances : 7532
{quote}
B/
Then, we made a testing class (see file 4) which tests a sample document which
obviously has most affinity with comp.graphics: (file 4) (lines 57-61)
{quote}
String[] doc = \{"mspublisher", "parallax", "polaroid", "corel", "illustrator",
"coreldraw"\};
SinglyClassifier2 sc = new SinglyClassifier2();
List<ClassifierResult> results = sc.classifyDocument(doc, "comp.graphics", 47);
System.out.println(results);
{quote}
But when we run this test we can see that the weight increases with class
affinity, since comp.graphics is the last class: (file 5)
bq. [ClassifierResult{category='sci.med', score=71.12823038989241},
ClassifierResult{category='talk.politics.mideast', score=71.12905966433597},
ClassifierResult{category='sci.crypt', score=71.13190725486677},
ClassifierResult{category='soc.religion.christian', score=71.133650306131},
ClassifierResult{category='talk.politics.guns', score=71.13395246918788},
ClassifierResult{category='rec.sport.hockey', score=71.135412697019},
ClassifierResult{category='rec.motorcycles', score=71.13588019314241},
ClassifierResult{category='talk.politics.misc', score=71.13646012313777},
ClassifierResult{category='rec.autos', score=71.13665909470443},
ClassifierResult{category='rec.sport.baseball', score=71.14030022524815},
ClassifierResult{category='comp.sys.mac.hardware', score=71.14259436929609},
ClassifierResult{category='alt.atheism', score=71.14375467011567},
ClassifierResult{category='talk.religion.misc', score=71.14396375715604},
ClassifierResult{category='misc.forsale', score=71.15726130106582},
ClassifierResult{category='comp.sys.ibm.pc.hardware', score=71.23220093257258},
ClassifierResult{category='sci.space', score=71.2370437832205},
ClassifierResult{category='comp.windows.x', score=71.48765607132557},
ClassifierResult{category='sci.electronics', score=71.50156901557527},
ClassifierResult{category='comp.os.ms-windows.misc', score=71.79695095091601},
ClassifierResult{category='comp.graphics', score=73.46107190566988}]
So there is obviously increase of the weight with increase of class affinity.
No complementary weight seems to be used.
However, when looking at files 2 and 3, we can see that the CNB testing
performs better than the NB testing. But we are unsure if what we are having is
_really_ a CNB, without relation to the better performance of whatever
algorithm is run in the case of file 3.
We believe there is a problem somewhere, either in
- the CBayesThetaNormalizer?
- the InMemoryBayesDataStore?
- the way we are using the classifier or following the tutorial?
Is the training really CNB? Or is it a problem with the CNB testing?
(I hope this comment was more clear than the last ones. We have an accepted
paper which makes use of Mahout, and we need to clarify this matter before
submitting the revised version.)
> Array returned by classifier.bayes.algorithm.CBayesAlgorithm.classifyDocument
> is sorted ascendant
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-605
> URL: https://issues.apache.org/jira/browse/MAHOUT-605
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Affects Versions: 0.4
> Environment: Linux
> Reporter: Robin Swezey
> Assignee: Robin Anil
> Priority: Minor
> Labels: bayesian, classification
> Fix For: 0.5
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> The array returned for a n-best call to classifyDocument is sorted ascendant
> instead of descendant.
> Ex:
> {quote}
> 47-best: [ClassifierResult\{category='香川県', score=32.28281232047167\},
> ClassifierResult\{category='宮崎県', score=32.28969992600906\}, ......,
> ClassifierResult\{category='愛知県', score=32.487981016587796\},
> ClassifierResult\{category='東京都', score=32.49189358054859\},
> ClassifierResult\{category='北海道', score=32.49811200756193\}]
> {quote}
> (classification of documents for Japanese prefectures)
> Inside the classifyDocument method, just before the return statement we found
> this line:
> {quote}
> Collections.reverse(result);
> {quote}
> Is this a mistake or a design choice? (we are not sure, hence the "Minor"
> priority)
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira