Re: kMeans Help

Grant Ingersoll Fri, 26 Jun 2009 20:00:26 -0700

Success!  Woo hoo.

On Jun 26, 2009, at 10:42 PM, Grant Ingersoll wrote:

So, the problem I'm having lies in the RandomSeedGenerator in thatit is writing out a Cluster, which calls Cluster.write() and does:
AbstractVector.writeVector(out, computeCentroid());

Now computeCentroid() does:
if (numPoints == 0)
     return pointTotal;
   else if (centroid == null) {
     // lazy compute new centroid
     centroid = pointTotal.divide(numPoints);
     Vector stds = pointSquaredTotal.times(numPoints).minus(
pointTotal.times(pointTotal)).assign(newSquareRootFunction())
         .divide(numPoints);
     std = stds.zSum() / 2;
   }
   return centroid;
In the case of the RandomSeedGenerator, numPoints is always == 0because the Cluster doesn't have any points added to it.Furthermore, pointTotal is an empty Vector of the same size as thecenter, due to the Cluster constructor:
   super();
   this.id = nextClusterId++;
   this.center = center;
   this.numPoints = 0;
   this.pointTotal = center.like();
   this.pointSquaredTotal = center.like();
The semantics of constructing a Cluster are odd to me. Do I alwayshave to immediately add a point to the Cluster in order for it to be"real", despite the fact that I added a Center? Isn't adding aCenter effectively giving the Cluster one point?
On Jun 26, 2009, at 8:45 PM, Grant Ingersoll wrote:
Still no dice.

On Jun 26, 2009, at 7:59 PM, Grant Ingersoll wrote:
We need to make that handled separately then from the variousjobs. That was one of the things that was different about theKMeansJob call.
On Jun 26, 2009, at 7:45 PM, Jeff Eastman wrote:
Found the call in the syntheticcontrol/kmeans.Job had true forthe overwrite output flag. Don't think that was your problem, butsomething similar must be at work.
Jeff Eastman wrote:
Running the latest trunk, I get a file not found exceptionrunning synthetic control on the $output/data file. Looks likeoutput got deleted somewhere but have not discovered where yet.Perhaps Canopy is broken or KMeans is purging output?
Grant Ingersoll wrote:
I'm running trunk. Using the data at http://people.apache.org/wikipedia/n2.tar.gz(a dump of 2302 documents from a Lucene index of Wikipedia.The chunks file in that same directory contains the originalfiles). Vectors are normalized using L2.
When I run K-Means on it via:org.apache.mahout.clustering.kmeans.KMeansDriver --input /Users/grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/part-full.txt --clusters /Users/grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/clusters --k 10 --output /Users/grantingersoll/projects/lucene/solr/wikipedia/devWorks/n2/k-output --distance org.apache.mahout.utils.CosineDistanceMeasure
I get the two directories seen in n2-output. The clusters-0and clusters-1 files both contain a single vector which is all 0.
I've also tried SquaredEuclidean, but to no avail.

Any insight into what I'm doing wrong would be appreciated.

Thanks,
Grant
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:
http://www.lucidimagination.com/search
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: kMeans Help

Reply via email to