Hi, I'm a noob and trying to run the wikipedia bayes example on EC2 (using a cdh4.5 setup). I've searched the archives and haven't been able to find info on this. I apologize if this is a duplicate question.
The cloudera install comes with Mahout 0.7. I've run into a few snags on the first step (chunking the data into pieces). The first was that it couldn't find wikipediaXMLSplitter but I found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter in the command it got past that error. (just changing the capitalization wasn't enough) However I am now stuck. I'm getting a java.lang.OutOfMemoryError: Java heap space error. I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error. See the full error here: http://pastebin.com/P5PYuR8U (I added a print statement to the mahout/bin just to confirm that my export of MAHOUT_HEAPSIZE was being successfully detected) I'm wondering whether some other setting is overriding the MAHOUT_HEAPSIZE? One of the hadoop or cloudera specific ones? Does anyone have any experience with this or suggestions? Thank you, Jessie Wright