Hi,
Recently I have faced a heap size error when I run 

  $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d 
$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o 
wikipedia/chunks -c 64

Here is the specs
1- XML file size = 44GB
2- System memory = 54GB (on virtualbox)
3- Heap size = 51GB (-Xmx51000m)

At the time of failure, I see that 571 chunks are created (hadoop dfs -ls) so 
36GB of the original file has been processed. Now here are my questions

1- Is there any way to resume the process? As stated before, 571 chunks have 
been created. So by resuming, it can create the rest of the chunks (572~).

2- Is it possible to parallelize the process? Assume, 100GB of heap is required 
to process the XML file and my system cannot afford that. Then we can create 20 
threads each requires 5GB of heap. Next by feeding the first 10 threads we can 
use the available 50GB of heap and after completion, we can feed the next set 
of threads.

 
Regards,
Mahmood

Reply via email to