Avoiding OOM for large datasets

Amir Mohammad Saied Wed, 04 Dec 2013 10:19:56 -0800

Hi,

I've been trying to run Mahout (with Hadoop) on our data for quite sometime
now. Everything is fine on relatively small data sets, but when I try to do
K-Means clustering with the aid of Canopy on like 300000 documents, I can't
even get past the canopy generation because of OOM. We're going to cluster
similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
desired results on sample data).


I tried setting both "mapred.map.child.java.opts", and
"mapred.reduce.child.java.opts" to "-Xmx4096M", I also
exported HADOOP_HEAPSIZE to 4000, and still having issues.

I'm running all of this in Hadoop's single node, pseudo-distributed mode on
a machine with 16GB of RAM.

Searching Internet for solutions I found this[1]. One of the bullet points
states that:

    "In all of the algorithms, all clusters are retained in memory by the
mappers and reducers"

So my question is, does Mahout on Hadoop only help in distributing CPU
bound operations? What one should do if they have a large dataset, and only
a handful of low-RAM commodity nodes?

I'm obviously a newbie, thanks for bearing with me.

[1]
http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c506307eb.3090...@windwardsolutions.com%3E

Cheers,

Amir

Avoiding OOM for large datasets

Reply via email to