Re: Avoiding OOM for large datasets

Amir Mohammad Saied Thu, 05 Dec 2013 07:39:40 -0800

Suneel,

Thanks!


I tried Streaming K-Means, and now I've two naive questions:

1) If I understand correctly to use the results of streaming k-means I need
to iterate over all of my vectors again and assign them to the cluster with
the closest centroid to the vector, right?

2) In clustering news, the number of clusters isn't known beforehand. We
used to use canopy as a fast approximate clustering technique, but as I
understand streaming k-means requires "K" in advance. How can I avoid
guessing K?

Regards,

Amir



On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <suneel_mar...@yahoo.com>wrote:

> Amir,
>
>
> This has been reported before by several others (and has been my
> experience too). The OOM happens during Canopy Generation phase of Canopy
> clustering because it only runs with a single reducer.
>
> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
> Streaming Kmeans clustering which is a quicker and more efficient than the
> traditional Canopy -> KMeans.
>
> See the following link for how to run Streaming KMeans.
>
>
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
>
>
>
>
>
>
>
>
>
>
>
> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
> amirsa...@gmail.com> wrote:
>
> Hi,
>
> I've been trying to run Mahout (with Hadoop) on our data for quite sometime
> now. Everything is fine on relatively small data sets, but when I try to do
> K-Means clustering with the aid of Canopy on like 300000 documents, I can't
> even get past the canopy generation because of OOM. We're going to cluster
> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to
> desired results on sample data).
>
> I tried setting both "mapred.map.child.java.opts", and
> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
> exported HADOOP_HEAPSIZE to 4000, and still having issues.
>
> I'm running all of this in Hadoop's single node, pseudo-distributed mode on
> a machine with 16GB of RAM.
>
> Searching Internet for solutions I found this[1]. One of the bullet points
> states that:
>
>     "In all of the algorithms, all clusters are retained in memory by the
> mappers and reducers"
>
> So my question is, does Mahout on Hadoop only help in distributing CPU
> bound operations? What one should do if they have a large dataset, and only
> a handful of low-RAM commodity nodes?
>
> I'm obviously a newbie, thanks for bearing with me.
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c506307eb.3090...@windwardsolutions.com%3E
>
> Cheers,
>
> Amir
>

Re: Avoiding OOM for large datasets

Reply via email to