I am pretty sure that there is something wrong with hadoop/mahout/java. With 
any configuration, it stuck at the chunk #571. Previous chunks are created 
rapidly but I see it waits for bout 30 minutes on 571 and that is the reason 
for heap error size.

I will try to submit a bug report.

 
Regards,
Mahmood



On Thursday, March 13, 2014 2:31 PM, Mahmood Naderan <nt_mahm...@yahoo.com> 
wrote:
 
Strange thing is that if I use either -Xmx128m of -Xmx16384m the process stops 
at the chunk #571 (571*64=36.5GB).
Still I haven't figured out is this a problem with JVM or Hadoop or Mahout?

I have tested various parameters on 16GB RAM


<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx2048m</value>

</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx4096m</value>

</property>

Is there an relation between the parameters and the amount of available memory?
I also see a HADOOP_HEAPSIZE in hadoop-env.sh which is commented by default. 
What is that?
 
Regards,
Mahmood



On Tuesday, March 11, 2014 11:57 PM, Mahmood Naderan <nt_mahm...@yahoo.com> 
wrote:
 
As I posted earlier, here is the result of a successful test

5.4GB XML file (which is larger than enwiki-latest-pages-articles10.xml) with 
4GB of RAM and -Xmx128m tooks 5 minutes to complete.

I didn't find a larger wikipedia XML file. Need
 to test 10GB, 20GB and 30GB files


 
Regards,
Mahmood




On Tuesday, March 11, 2014 11:41 PM, Andrew Musselman 
<andrew.mussel...@gmail.com> wrote:

Can you please try running this on a smaller file first, per Suneel's
comment a while back:

"Please first try running this on a smaller dataset like
'enwiki-latest-pages-articles10.xml' as opposed to running on the entire
english wikipedia."



On Tue, Mar 11, 2014 at 12:56 PM, Mahmood Naderan <nt_mahm...@yahoo.com>wrote:

> Hi,
> Recently I have faced a heap size error when I run
>
>   $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d
>
$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o
> wikipedia/chunks -c 64
>
> Here is the specs
> 1- XML file size = 44GB
> 2- System memory = 54GB (on virtualbox)
> 3- Heap size = 51GB (-Xmx51000m)
>
> At the time of failure, I see that 571 chunks are created (hadoop dfs -ls)
> so 36GB of the original file has been processed. Now here are my questions
>
> 1- Is there any way to
 resume the process? As stated before, 571 chunks
> have been created. So by resuming, it can create the rest of the chunks
> (572~).
>
> 2- Is it possible to parallelize the process? Assume, 100GB of heap is
> required to process the XML file and my system cannot
afford that. Then we
> can create 20 threads each requires 5GB of heap. Next by feeding the first
> 10 threads we can use the available 50GB of heap and after completion, we
> can feed the next set of threads.
>
>
> Regards,
> Mahmood

Reply via email to