[
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robin Anil updated MAHOUT-60:
-----------------------------
Attachment: MAHOUT-60.patch
To Split Wikipedia xml dump into small XML chunks
{noformat}
hadoop jar build/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.examples.classifiers.cbayes.WikipediaXmlSplitter -d
/home/robin/data/wikipedia/articles/enwiki-latest-pages-articles.xml -o
/home/robin/data/wikipedia/chunks/ -c 64
{noformat}
Put the chunks into the dfs
{noformat}
hadoop dfs -put /home/robin/data/wikipedia/chunks/ wikipediadump
{noformat}
Create the countries based Split of wikipedia dataset.(See the attached
country.txt file)
{noformat}
hadoop jar build/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.examples.classifiers.cbayes.WikipediaDatasetCreator -i
wikipediadump -o wikipediainput -c pathto/country.txt
{noformat}
Train the Classifier on the Countries bases split of wikipedia
{noformat}
$bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.examples.classifiers.cbayes.TrainTwentyNewsgroups -t -i
wikipediainput -o wikipediamodel
{noformat}
Fetch the Input Files for Testing
{noformat}
hadoop dfs -get wikipediainput wikipediainput
{noformat}
Test the Classifier
{noformat}
$bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar
org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p
wikipediamodel -t wikipediainput
{noformat}
> Complementary Naive Bayes
> -------------------------
>
> Key: MAHOUT-60
> URL: https://issues.apache.org/jira/browse/MAHOUT-60
> Project: Mahout
> Issue Type: Sub-task
> Components: Classification
> Reporter: Robin Anil
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 0.1
>
> Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch,
> MAHOUT-60.patch, twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.