As I posted earlier, here is the result of a successful test

5.4GB XML file (which is larger than enwiki-latest-pages-articles10.xml) with 
4GB of RAM and -Xmx128m tooks 5 minutes to complete.

I didn't find a larger wikipedia XML file. Need to test 10GB, 20GB and 30GB 
files


 
Regards,
Mahmood



On Tuesday, March 11, 2014 11:41 PM, Andrew Musselman 
<andrew.mussel...@gmail.com> wrote:
 
Can you please try running this on a smaller file first, per Suneel's
comment a while back:

"Please first try running this on a smaller dataset like
'enwiki-latest-pages-articles10.xml' as opposed to running on the entire
english wikipedia."



On Tue, Mar 11, 2014 at 12:56 PM, Mahmood Naderan <nt_mahm...@yahoo.com>wrote:

> Hi,
> Recently I have faced a heap size error when I run
>
>   $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d
>
 $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o
> wikipedia/chunks -c 64
>
> Here is the specs
> 1- XML file size = 44GB
> 2- System memory = 54GB (on virtualbox)
> 3- Heap size = 51GB (-Xmx51000m)
>
> At the time of failure, I see that 571 chunks are created (hadoop dfs -ls)
> so 36GB of the original file has been processed. Now here are my questions
>
> 1- Is there any way to resume the process? As stated before, 571 chunks
> have been created. So by resuming, it can create the rest of the chunks
> (572~).
>
> 2- Is it possible to parallelize the process? Assume, 100GB of heap is
> required to process the XML file and my system cannot
 afford that. Then we
> can create 20 threads each requires 5GB of heap. Next by feeding the first
> 10 threads we can use the available 50GB of heap and after completion, we
> can feed the next set of threads.
>
>
> Regards,
> Mahmood

Reply via email to