[ https://issues.apache.org/jira/browse/MAHOUT-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Palumbo updated MAHOUT-1527: ----------------------------------- Attachment: MAHOUT-1527.patch Patch fixes WikipediaToSequenceFile.java. Also changes the output of WikipediaMapper.java to output Key: /Category/document_name. I'm not sure if this is the way to go but it could be the convention for now. The new Naive Bayes expects input like this, so it works well for documents classified by their respective directories and keeps it easy to use seqdirectory. I basically modified classify-20Newsgroups.sh to run the wikipedia CBayes example on a medium sized wikipediaXMLdump. I'm not sure if we're looking for new example scripts, but i included it in the patch anyways. I've set it up to only look at 10 countries for a couple reasons. 1. Confusion matrix fits on the screen 2. When using all countries split will almost always put a label into the training set that was not encountered in the training set, and will thus crash testnb at the confusion matrix. This script will occasionally run into the same problem of having a split that crashes the confusion matrix- but rarely with these settings. I have another script here that gives the option to use all of the countries and a constant size test set ( same number of docs in the test set for each category) but it needs a little work. For now I've included the simple one. If this script is going to be added, let me know and I'll do some work on it. > Fix wikipedia classifier example > -------------------------------- > > Key: MAHOUT-1527 > URL: https://issues.apache.org/jira/browse/MAHOUT-1527 > Project: Mahout > Issue Type: Task > Components: Classification, Documentation, Examples > Affects Versions: 0.7, 0.8, 0.9 > Reporter: Sebastian Schelter > Fix For: 1.0 > > Attachments: MAHOUT-1527.patch > > > The examples package has a classification showcase for prediciting the labels > of wikipedia pages. Unfortunately, the example is totally broken: > It relies on the old NB implementation which has been removed, suggests to > use the whole wikipedia as input, which will not work well on a single > machine and the documentation uses commands that have long been removed from > bin/mahout. > The example needs to be updated to use the current naive bayes implementation > and documentation on the website needs to be written. -- This message was sent by Atlassian JIRA (v6.2#6252)