RE: any pointer to run wikipedia bayes example

Andrew Palumbo Fri, 22 Aug 2014 10:54:39 -0700





(1) I guess the NB wiki example doesn't work with the Mahout 0.9 release. So if 
I need a rather stable Mahout release (because I have made some changes to my 
local Mahout 0.9 download to tailor some of my own requirements) ,would it be 
better that I just apply some patch (e.g., the patch mentioned in 
https://issues.apache.org/jira/browse/MAHOUT-1527) to make Mahout 0.9 work on  
this NB wiki example  -- NB wiki example is the only benchmark that I want but 
not available in Mahout 0.9. Or, do you think I'd better just check out the 
current trunk to start it over ?


NB wiki does not work with Mahout 0.9.

I would definitely suggest checking out the current trunk from github.  There 
have been a lot of changes to since 0.9 including a renaming of the Mahout-Core 
module to Mahout-MapReduce-Legacy, so some of the patches may not be compatible 
with the 0.9 codebase (or will need at least some manual filename changes).  
There have also been several bugfixes and enhancements to the Naive Bayes 
algorithm itself.  As Suneel and Pat mentioned, building from the current trunk 
should solve your hadoop 2.x problems as well.

If you did need to apply the patches to the 0.9 source you would need at least 
MAHOUT-1527, MAHOUT-1558 and MAHOUT-1555 , MAHOUT-1503 and MAHOUT-1504.  

I do think that you would be much better off checking out the current master.  
You can keep different versions and change the $MAHOUT_HOME to the desired 
directory and  run $mvn clean install -DskipTests  to change between them. 



(2) For the 20 Newsgroups Classifier example, the original data set has already 
been labeled (i.e., they are nicely placed in their own category's directory)  
Wikipedia example seems to need some preprocessing to generate the label for 
each document first, as you mentioned, by providing a category file.   I am 
curious if exactMatch/all is turned on, would it make the "unknown" category 
particularly large ?

The wikipedia documents are labeled with a [[Category: xxx]] tag in the 
document text itself. Documents can have more than one category. The label is 
extracted from the document via the WikipediaMapper job [1]: 

    $grep Category enwiki-latest-pages-articles.xml 

may yield results like:
    ...
    [[Category:United States]]  // Labeled as United States if exactMatch is on 
or off
    [[Category:United States Air Force]] // Labeled as United States if 
exactMatch is off
                                                               // otherwise 
both category and document are rejected 
                                                               // from dataset. 
    [[Category:Baseball]] // Labeled as unknown if -all is set otherwise both 
category 
                                        // and document are rejected from 
dataset.
    ...

Yes, if the -all option is set you will get the (full) dataset likely highly 
skewed towards the unknown category (dependent on your category.txt file).  

The exactMatch option, when set,  will actually likely return a slightly 
smaller set as documents without Category labels that exactly match the 
categories provided will be rejected from the set.


What's the algorithm used for fuzzy labeling (i.e., without exactMatch turn 
on)? Turning off exactMatch sounds like a unsupervised classification algorithm 
to me.


The mapper simply parses each document for "[[Category:" and checks whether the 
category is contained within the Category.txt file. If exactMatch is not set it 
uses Java String.contains(category). see findMatchingCategory() in [1] -line 
128.


What is the recommended practice -- to turn exactMatch/all on or to turn it off 
-- when preparing the wikipedia dataset ?

I would recommend using neither exactMatch nor all. 
  

Also, is it legit to provide other (sensible) category file other than a 
country list, e.g., (sports, science, history ...) ? 


Yes, you should be able to create a category file using any categories. You can 
get an idea of how pages are labeled by browsing wikipedia pages.  The 
categories are at the bottom of the page. 


(3) Could you provide some pointers to how the Wikipedia XML file is formed 
(i.e., is there any document that discusses its structure.) so that I can try 
to extract more information other than the raw texts out of it.

There is some information here:
http://meta.wikimedia.org/wiki/Help:Export#Export_format
http://en.wikipedia.org/wiki/Wikipedia:Database_download


For example, if I want to also extract some other information (i.e., creation 
date, modified date, last edit date) from the Wikipedia dataset, is there 
Mahout API to do it, or I need to write my own (possibly by modeling after how 
exactMatch functionality is implemented) ?


There is no Mahout API for this.  For this you could Customize the 
WikipediaMapper [1]. eg.  match on modification date by extracting the 
<revision><timestamp> value from the document, and matching the categories to 
that.     

[1]https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java


Thank you very much !



Wei  





(2) Is it legit to use some other catogeries other than 

Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the 
current trunk, you can use the classify-wiki.sh example.  There i



From:   Andrew Palumbo <ap....@outlook.com>

To:     "user@mahout.apache.org" <user@mahout.apache.org>

Date:   08/21/2014 02:28 PM

Subject:        RE: any pointer to run wikipedia bayes example








Hello,



Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
example.  There is currently no documentation on the Mahout site for this.



You can run this script to build and test an NB classifier for option (1) 10 
arbitrary countries or option (2) 2 countries (United States and United Kingdom)



By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
run on the full set you'll have to change the download by commenting out line 
78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
changing datasets- option (3).*





The step by step process for  Creating a Naive Bayes Classifier for the 
wikipedia XML dump is very similar to creating the the 20 Newsgroups 
Classifier.  The only difference being that instead of running $mahout 
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on 
the unzipped wikipedia xml dump.



$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file 
of categories [2] and starts an MR job to parse the each document in the XML 
file.  This process will seek to extract documents with category which 
(exactly, if the exactMatchOnly option is set) matches a line in the category 
file.  If no match is found and the -all option is set, the document will be 
dumped into an "unknown" category.

The documents will then be written out as a <Text,Text> sequence file of the 
form (K: /category/document_title , V: document) .



There are 3 different example category files available to in the 
/examples/src/test/resources directory:  country.txt, country10.txt and 
country2.txt.



The CLI options for seqwiki are as follows:



    -input           (-i)             input pathname String

    -output         (-o)           the output pathname String

    -categories  (-c)            the file containing the Wikipedia categories

    -exactMatchOnly (-e)    if set, then the Wikipedia category must match 
exactly instead of simply containing the category string

    -all              (-all)            if set select all categories 



>From there you just need to run  seq2sparse, split, trainnb, testnb as in the 
>example script.



Especially for the Binary classification problem you should have better results 
using 3 or 4-grams and a low maxDF cuttoff like 30.



[1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh

[2] 
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt





Subject: Re: any pointer to run wikipedia bayes example

To: user@mahout.apache.org

From: w...@us.ibm.com

Date: Wed, 20 Aug 2014 09:50:42 -0400





hi, 







After did a bit more searching, I found 
https://issues.apache.org/jira/browse/MAHOUT-1527



The version of Mahout that I have been working on is Mahout 0.9 (from 
http://mahout.apache.org/general/downloads.html), which I downloaded in April.



Albeit the latest stable release, it doesn't include the patch mentioned in 
https://issues.apache.org/jira/browse/MAHOUT-1527







Then I realized had I cloned the latest mahout, I would get a script that 
classify-wiki.sh, and probably can start from there.  







 Sorry for the spam! 







Thanks,



Wei







Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian 
network 20news group example provided







From:            Wei Zhang/Watson/IBM@IBMUS



To:              user@mahout.apache.org



Date:            08/19/2014 06:18 PM



Subject:                 any pointer to run wikipedia bayes example

























Hi,







I have been able to run the bayesian network 20news group example provided



at Mahout website.







I am interested in running the Wikipedia bayes example, as it is a much



larger dataset.



>From several googling attempts,  I figured it is a bit different workflow



than running the 20news group example -- e.g., I would need to provide a



categories.txt file, and invoke WikipediaXmlSplitter,  call



wikipediaDataSetCreator and etc.







I am wondering is there a document somewhere that describes the process of



running Wikipedia bayes example ?



https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no



longer work.







Greatly appreciated!







Wei
RE: any pointer to run wikipedia bayes example

Reply via email to