The extracted size is about 960MB (enwiki-latest-pages-articles10.xml). With 4GB of RAM set for the OS and -Xmx128m for Hadoop, it took 77 seconds to create 64MB chunks. I was able to see 15 chunks with "hadoop dfs -ls".
P.S: Whenever I modify -Xmx value in mapred-site.xml, I run $HADOOP/sbin/stop-all.sh && $HADOOP/sbin/start-all.sh Is that necessary? Regards, Mahmood On Monday, March 10, 2014 5:30 PM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: Morning Mahmood, Please first try running this on a smaller dataset like 'enwiki-latest-pages-articles10.xml' as opposed to running on the entire english wikipedia. On Monday, March 10, 2014 2:59 AM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote: Thanks for the update. Thing is, when that command is running, in another terminal I run 'top' command and I see that the java process takes less 1GB of memory. As another test, I increased the size of memory to 48GB (since I am working with virtualbox) and set the heap size to -Xmx45000m Still I get the heap error. I expect that there should be a more meaningful error message that *who* needs more heap size? Hadoop, Mahout, Java, ....? Regards, Mahmood On Monday, March 10, 2014 1:31 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote: Mahmood, Firstly thanks for starting this email thread and for highlighting the issues with wikipedia example. Since you raised this issue, I updated the new wikipedia examples page at http://mahout.apache.org/users/classification/wikipedia-bayes-example.html and also responded to a similar question on StackOverFlow at http://stackoverflow.com/questions/19505422/mahout-error-when-try-out-wikipedia-examples/22286839#22286839. I am assuming that u r running this locally on ur machine and r just trying out the examples. Try out Sebastian's suggestion or else try running the example on a much smaller dataset of wikipedia articles. Lastly, w do realize that u have been struggling with this for about 3 days now. Mahout presently lacks an entry for 'wikipediaXmlSplitter' in driver.classes.default.props. Not sure at what point in time and which release that had happened. Please file a Jira for this and submit a patch. On Sunday, March 9, 2014 2:25 PM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote: Hi Suneel, Do you have any idea? Searching the web shows many question regarding the heap size for wikipediaXMLSplitter. I have increased the the memory size to 16GB and still get that error. I have to say that using 'top' command, I see only 1GB of memory is in use. So I wonder why it report such an error. Is this a problem with Java, Mahout, Hadoop, ..? Regards, Mahmood On Sunday, March 9, 2014 4:00 PM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote: Excuse me, I added the -Xmx option and restarted the hadoop services using sbin/stop-all.sh && sbin/start-all.sh however still I get heap size error. How can I find the correct and needed heap size? Regards, Mahmood On Sunday, March 9, 2014 1:37 PM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote: OK I found that I have to add this property to mapred-site.xml <property> <name>mapred.child.java.opts</name> <value>-Xmx2048m</value> </property> Regards, Mahmood On Sunday, March 9, 2014 11:39 AM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote: Hello, I ran this command ./bin/mahout wikipediaXMLSplitter -d examples/temp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64 but got this error Exception in thread "main" java.lang.OutOfMemoryError: Java heap space There are many web pages regarding this and the solution is to add "-Xmx 2048M" for example. My question is, that option should be passed to java command and not Mahout. As result, running "./bin/mahout -Xmx 2048M" shows that there is no such option. What should I do? Regards, Mahmood