Re: Using SVD with Canopy/KMeans

Ted Dunning Sat, 11 Sep 2010 14:51:36 -0700

Should be close.  The matrixMult step may be redundant if you want to
cluster the same data that you decomposed.  That would make the second
transpose unnecessary as well.


On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll <[email protected]>wrote:

> To put this in bin/mahout speak, this would look like, munging some names
> and taking liberties with the actual argument to be passed in:
>
> bin/mahout svd (original -> svdOut)
> bin/mahout cleansvd ...
> bin/mahout transpose svdOut -> svdT
> bin/mahout transpose original -> originalT
> bin/mahout matrixmult originalT svdT -> newMatrix
> bin/mahout kmeans newMatrix
>
> Is that about right?
>
>
> On Sep 3, 2010, at 11:19 AM, Jeff Eastman wrote:
>
> > Ok, the transposed computation seems to work and the cast exception was
> caused by my unit test writing LongWritable keys to the testdata file. The
> following test produces a comparable answer to the non-distributed case. I
> still want to rename the method to transposeTimes for clarity. And better,
> implement timesTranspose to make this particular computation more efficient:
> >
> >  public void testKmeansDSVD() throws Exception {
> >    DistanceMeasure measure = new EuclideanDistanceMeasure();
> >    Path output = getTestTempDirPath("output");
> >    Path tmp = getTestTempDirPath("tmp");
> >    Path eigenvectors = new Path(output, "eigenvectors");
> >    int desiredRank = 13;
> >    DistributedLanczosSolver solver = new DistributedLanczosSolver();
> >    Configuration config = new Configuration();
> >    solver.setConf(config);
> >    Path testData = getTestTempDirPath("testdata");
> >    int sampleDimension = sampleData.get(0).get().size();
> >    solver.run(testData, tmp, eigenvectors, sampleData.size(),
> sampleDimension, false, desiredRank);
> >
> >    // now multiply the testdata matrix and the eigenvector matrix
> >    DistributedRowMatrix svdT = new DistributedRowMatrix(eigenvectors,
> tmp, desiredRank - 1, sampleDimension);
> >    JobConf conf = new JobConf(config);
> >    svdT.configure(conf);
> >    DistributedRowMatrix a = new DistributedRowMatrix(testData, tmp,
> sampleData.size(), sampleDimension);
> >    a.configure(conf);
> >    DistributedRowMatrix sData = a.transpose().times(svdT.transpose());
> >    sData.configure(conf);
> >
> >    // now run the Canopy job to prime kMeans canopies
> >    CanopyDriver.runJob(sData.getRowPath(), output, measure, 8, 4, false,
> false);
> >    // now run the KMeans job
> >    KMeansDriver.runJob(sData.getRowPath(), new Path(output,
> "clusters-0"), output, measure, 0.001, 10, 1, true, false);
> >    // run ClusterDumper
> >    ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
> "clusters-2"), new Path(output, "clusteredPoints"));
> >    clusterDumper.printClusters(termDictionary);
> >  }
> >
> > On 9/3/10 7:54 AM, Jeff Eastman wrote:
> >> Looking at the single unit test of DMR.times() it seems to be
> implementing Matrix expected = inputA.transpose().times(inputB), and not
> inputA.times(inputB.transpose()), so the bounds checking is correct as
> implemented. But the method still has the wrong name and AFAICT is not
> useful for performing this particular computation. Should I use this
> instead?
> >>
> >> DistributedRowMatrix sData =
> a.transpose().t[ransposeT]imes(svdT.transpose())
> >>
> >> ugh! And it still fails with:
> >>
> >> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
> be cast to org.apache.hadoop.io.IntWritable
> >>    at
> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:1)
> >>    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
> >>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> >>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>

Re: Using SVD with Canopy/KMeans

Reply via email to