[ 
https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Palumbo updated MAHOUT-1527:
-----------------------------------

    Attachment: MAHOUT-1527.patch

Patch fixes WikipediaToSequenceFile.java.  Also changes the output of 
WikipediaMapper.java to output Key: /Category/document_name. 

I'm not sure if this is the way to go but it could be the convention for now. 
The new Naive Bayes expects input like this, so it works well for documents 
classified by their respective directories and keeps it easy to use 
seqdirectory.       

I basically modified classify-20Newsgroups.sh to run the wikipedia CBayes 
example on a medium sized wikipediaXMLdump.  I'm not sure if  we're looking for 
new  example scripts, but i included it in the patch anyways.

I've set it up to only look at 10 countries for a couple reasons.
  1. Confusion matrix fits on the screen
  2.  When using all  countries split will almost always put a label into the 
training set that was not encountered in the training set, and will thus crash 
testnb at the confusion matrix.

This script will occasionally run into the same problem of having a split that 
crashes the confusion matrix-  but rarely with these settings.

I have another script here that gives the option to use all of the countries 
and a constant size test set ( same number of docs in the test set for each 
category) but it needs a little work.  For now I've included the simple one.  

If this script is going to be added, let me know and I'll do some work on it.
  

> Fix wikipedia classifier example
> --------------------------------
>
>                 Key: MAHOUT-1527
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1527
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification, Documentation, Examples
>    Affects Versions: 0.7, 0.8, 0.9
>            Reporter: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1527.patch
>
>
> The examples package has a classification showcase for prediciting the labels 
> of wikipedia  pages. Unfortunately, the example is totally broken:
> It relies on the old NB implementation which has been removed, suggests to 
> use the whole wikipedia as input, which will not work well on a single 
> machine and the documentation uses commands that have long been removed from 
> bin/mahout. 
> The example needs to be updated to use the current naive bayes implementation 
> and documentation on the website needs to be written.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to