[ 
https://issues.apache.org/jira/browse/MAHOUT-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil updated MAHOUT-60:
-----------------------------

    Attachment: MAHOUT-60.patch

To Split Wikipedia xml dump into small XML chunks
 {noformat} 
hadoop jar build/apache-mahout-0.1-dev-ex.jar 
org.apache.mahout.examples.classifiers.cbayes.WikipediaXmlSplitter -d 
/home/robin/data/wikipedia/articles/enwiki-latest-pages-articles.xml -o  
/home/robin/data/wikipedia/chunks/ -c 64
  {noformat} 

Put the chunks into the dfs
 {noformat} 
 hadoop dfs -put /home/robin/data/wikipedia/chunks/ wikipediadump
 {noformat} 

Create the countries based Split of wikipedia dataset.(See the attached 
country.txt file)
{noformat}
 hadoop jar build/apache-mahout-0.1-dev-ex.jar 
org.apache.mahout.examples.classifiers.cbayes.WikipediaDatasetCreator -i 
wikipediadump -o wikipediainput -c pathto/country.txt
{noformat}

Train the Classifier on the Countries bases split of wikipedia
{noformat}
$bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar 
org.apache.mahout.examples.classifiers.cbayes.TrainTwentyNewsgroups -t -i 
wikipediainput -o wikipediamodel
{noformat}


Fetch the Input Files for Testing
{noformat}
 hadoop dfs -get wikipediainput wikipediainput 
{noformat}

Test the Classifier
{noformat}
$bin/hadoop jar <MAHOUT_HOME>/build/apache-mahout-0.1-dev-ex.jar 
org.apache.mahout.examples.classifiers.cbayes.TestTwentyNewsgroups -p 
wikipediamodel -t  wikipediainput
{noformat}



> Complementary Naive Bayes
> -------------------------
>
>                 Key: MAHOUT-60
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-60
>             Project: Mahout
>          Issue Type: Sub-task
>          Components: Classification
>            Reporter: Robin Anil
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 0.1
>
>         Attachments: MAHOUT-60.patch, MAHOUT-60.patch, MAHOUT-60.patch, 
> MAHOUT-60.patch, twcnb.jpg
>
>
> The focus is to implement an improved text classifier based on this paper 
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to