Strange thing is that if I use either -Xmx128m of -Xmx16384m the process stops at the chunk #571 (571*64=36.5GB). Still I haven't figured out is this a problem with JVM or Hadoop or Mahout?
I have tested various parameters on 16GB RAM <property> <name>mapred.map.child.java.opts</name> <value>-Xmx2048m</value> </property> <property> <name>mapred.reduce.child.java.opts</name> <value>-Xmx4096m</value> </property> Is there an relation between the parameters and the amount of available memory? I also see a HADOOP_HEAPSIZE in hadoop-env.sh which is commented by default. What is that? Regards, Mahmood On Tuesday, March 11, 2014 11:57 PM, Mahmood Naderan <nt_mahm...@yahoo.com> wrote: As I posted earlier, here is the result of a successful test 5.4GB XML file (which is larger than enwiki-latest-pages-articles10.xml) with 4GB of RAM and -Xmx128m tooks 5 minutes to complete. I didn't find a larger wikipedia XML file. Need to test 10GB, 20GB and 30GB files Regards, Mahmood On Tuesday, March 11, 2014 11:41 PM, Andrew Musselman <andrew.mussel...@gmail.com> wrote: Can you please try running this on a smaller file first, per Suneel's comment a while back: "Please first try running this on a smaller dataset like 'enwiki-latest-pages-articles10.xml' as opposed to running on the entire english wikipedia." On Tue, Mar 11, 2014 at 12:56 PM, Mahmood Naderan <nt_mahm...@yahoo.com>wrote: > Hi, > Recently I have faced a heap size error when I run > > $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d > $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o > wikipedia/chunks -c 64 > > Here is the specs > 1- XML file size = 44GB > 2- System memory = 54GB (on virtualbox) > 3- Heap size = 51GB (-Xmx51000m) > > At the time of failure, I see that 571 chunks are created (hadoop dfs -ls) > so 36GB of the original file has been processed. Now here are my questions > > 1- Is there any way to resume the process? As stated before, 571 chunks > have been created. So by resuming, it can create the rest of the chunks > (572~). > > 2- Is it possible to parallelize the process? Assume, 100GB of heap is > required to process the XML file and my system cannot afford that. Then we > can create 20 threads each requires 5GB of heap. Next by feeding the first > 10 threads we can use the available 50GB of heap and after completion, we > can feed the next set of threads. > > > Regards, > Mahmood