wikipedia bayes quickstart example on EC2 (cloudera)

Jessie Wright Sat, 01 Mar 2014 09:13:24 -0800

Hi,

I'm a noob and trying to run the wikipedia bayes example on EC2 (using a
cdh4.5 setup).  I've searched the archives and haven't been able to find
info on this.  I apologize if this is a duplicate question.


The cloudera install comes with Mahout 0.7.

I've run into a few snags on the first step (chunking the data into
pieces).  The first was that it couldn't find  wikipediaXMLSplitter but I
found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter
in the command it got past that error. (just changing
the capitalization wasn't enough)

However I am now stuck.  I'm getting a java.lang.OutOfMemoryError: Java
heap space error.
I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error.
See the full error here: http://pastebin.com/P5PYuR8U  (I added a print
statement to the mahout/bin just to confirm that my export of
MAHOUT_HEAPSIZE was being successfully detected)

I'm wondering whether some other  setting is overriding the
MAHOUT_HEAPSIZE?  One of the hadoop or cloudera specific ones?

Does anyone have any experience with this or suggestions?

Thank you,

Jessie Wright

wikipedia bayes quickstart example on EC2 (cloudera)

Reply via email to