This is not right.  THe sequential version would have finished long before
this for any reasonable value of k.

I do note, however, that you have set k = 200,000 where you only have
300,000 documents.  Depending on which value you set (I don't have the code
handy), this may actually be increased inside the streaming k-means when it
computes the number of sketch centroids by a factor of roughly 2 log N
\approx 2 * 18.  This gives far more clusters than you have data points
which is silly.

Try again with a more reasonably value of k such as 1000.





On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied <amirsa...@gmail.com>wrote:

> Hi,
>
> I first tried Streaming K-means with about 5000 news stories, and it worked
> just fine. Then I tried it over 300,000 news stories and gave it 10GB of
> RAM. After more than 43 hours, It was still in the last merge-pass when I
> eventually decided to stop it.
>
> I set K to 200000 and KM 2522308 (its for detecting similar/related news
> stories). Using these values, is it expected to take so long?
>
> Cheers,
>
> amir
>
>
> On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <amirsa...@gmail.com
> >wrote:
>
> > Suneel,
> >
> > Thanks!
> >
> > I tried Streaming K-Means, and now I've two naive questions:
> >
> > 1) If I understand correctly to use the results of streaming k-means I
> > need to iterate over all of my vectors again and assign them to the
> cluster
> > with the closest centroid to the vector, right?
> >
> > 2) In clustering news, the number of clusters isn't known beforehand. We
> > used to use canopy as a fast approximate clustering technique, but as I
> > understand streaming k-means requires "K" in advance. How can I avoid
> > guessing K?
> >
> > Regards,
> >
> > Amir
> >
> >
> >
> > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <suneel_mar...@yahoo.com
> >wrote:
> >
> >> Amir,
> >>
> >>
> >> This has been reported before by several others (and has been my
> >> experience too). The OOM happens during Canopy Generation phase of
> Canopy
> >> clustering because it only runs with a single reducer.
> >>
> >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new
> >> Streaming Kmeans clustering which is a quicker and more efficient than
> the
> >> traditional Canopy -> KMeans.
> >>
> >> See the following link for how to run Streaming KMeans.
> >>
> >>
> >>
> http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied <
> >> amirsa...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I've been trying to run Mahout (with Hadoop) on our data for quite
> >> sometime
> >> now. Everything is fine on relatively small data sets, but when I try to
> >> do
> >> K-Means clustering with the aid of Canopy on like 300000 documents, I
> >> can't
> >> even get past the canopy generation because of OOM. We're going to
> cluster
> >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead
> to
> >> desired results on sample data).
> >>
> >> I tried setting both "mapred.map.child.java.opts", and
> >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also
> >> exported HADOOP_HEAPSIZE to 4000, and still having issues.
> >>
> >> I'm running all of this in Hadoop's single node, pseudo-distributed mode
> >> on
> >> a machine with 16GB of RAM.
> >>
> >> Searching Internet for solutions I found this[1]. One of the bullet
> points
> >> states that:
> >>
> >>     "In all of the algorithms, all clusters are retained in memory by
> the
> >> mappers and reducers"
> >>
> >> So my question is, does Mahout on Hadoop only help in distributing CPU
> >> bound operations? What one should do if they have a large dataset, and
> >> only
> >> a handful of low-RAM commodity nodes?
> >>
> >> I'm obviously a newbie, thanks for bearing with me.
> >>
> >> [1]
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c506307eb.3090...@windwardsolutions.com%3E
> >>
> >> Cheers,
> >>
> >> Amir
> >>
> >
> >
>

Reply via email to