This is not right. THe sequential version would have finished long before this for any reasonable value of k.
I do note, however, that you have set k = 200,000 where you only have 300,000 documents. Depending on which value you set (I don't have the code handy), this may actually be increased inside the streaming k-means when it computes the number of sketch centroids by a factor of roughly 2 log N \approx 2 * 18. This gives far more clusters than you have data points which is silly. Try again with a more reasonably value of k such as 1000. On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied <amirsa...@gmail.com>wrote: > Hi, > > I first tried Streaming K-means with about 5000 news stories, and it worked > just fine. Then I tried it over 300,000 news stories and gave it 10GB of > RAM. After more than 43 hours, It was still in the last merge-pass when I > eventually decided to stop it. > > I set K to 200000 and KM 2522308 (its for detecting similar/related news > stories). Using these values, is it expected to take so long? > > Cheers, > > amir > > > On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied <amirsa...@gmail.com > >wrote: > > > Suneel, > > > > Thanks! > > > > I tried Streaming K-Means, and now I've two naive questions: > > > > 1) If I understand correctly to use the results of streaming k-means I > > need to iterate over all of my vectors again and assign them to the > cluster > > with the closest centroid to the vector, right? > > > > 2) In clustering news, the number of clusters isn't known beforehand. We > > used to use canopy as a fast approximate clustering technique, but as I > > understand streaming k-means requires "K" in advance. How can I avoid > > guessing K? > > > > Regards, > > > > Amir > > > > > > > > On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi <suneel_mar...@yahoo.com > >wrote: > > > >> Amir, > >> > >> > >> This has been reported before by several others (and has been my > >> experience too). The OOM happens during Canopy Generation phase of > Canopy > >> clustering because it only runs with a single reducer. > >> > >> If you are using Mahout 0.8 (or trunk), suggest that u look at the new > >> Streaming Kmeans clustering which is a quicker and more efficient than > the > >> traditional Canopy -> KMeans. > >> > >> See the following link for how to run Streaming KMeans. > >> > >> > >> > http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied < > >> amirsa...@gmail.com> wrote: > >> > >> Hi, > >> > >> I've been trying to run Mahout (with Hadoop) on our data for quite > >> sometime > >> now. Everything is fine on relatively small data sets, but when I try to > >> do > >> K-Means clustering with the aid of Canopy on like 300000 documents, I > >> can't > >> even get past the canopy generation because of OOM. We're going to > cluster > >> similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead > to > >> desired results on sample data). > >> > >> I tried setting both "mapred.map.child.java.opts", and > >> "mapred.reduce.child.java.opts" to "-Xmx4096M", I also > >> exported HADOOP_HEAPSIZE to 4000, and still having issues. > >> > >> I'm running all of this in Hadoop's single node, pseudo-distributed mode > >> on > >> a machine with 16GB of RAM. > >> > >> Searching Internet for solutions I found this[1]. One of the bullet > points > >> states that: > >> > >> "In all of the algorithms, all clusters are retained in memory by > the > >> mappers and reducers" > >> > >> So my question is, does Mahout on Hadoop only help in distributing CPU > >> bound operations? What one should do if they have a large dataset, and > >> only > >> a handful of low-RAM commodity nodes? > >> > >> I'm obviously a newbie, thanks for bearing with me. > >> > >> [1] > >> > >> > http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c506307eb.3090...@windwardsolutions.com%3E > >> > >> Cheers, > >> > >> Amir > >> > > > > >