Performance issue in CanopyClusterer

WangRamon Mon, 07 Nov 2011 00:30:27 -0800



Hi All  I'm using CanopyClusterer, the input is vectors of Type 
RandomAccessSparseVector, each vector may have 1~99 attributes. When I'm 
running CanopyClusterer on Hadoop, i find it was very very slow, so i get the 
stacktrace of the map tasks, i find the following output:       at 
org.apache.mahout.clustering.AbstractCluster.formatVector(AbstractCluster.java:301)
   
        at 
org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:161)
     At line 161 of CanopyClusterer, it's just a log output statement, it 
should have something like this "if(log.isDebugEnabled())" to avoid running if 
the log level is not debug, but this is not the root cause, the root cause in 
my case is AbstractCluster.formatVector is so slow to complete, after i comment 
"AbstractCluster.formatVector" everything goes well, can any body have a look 
at this, thank you very much.      Cheers  Ramon

Performance issue in CanopyClusterer

Reply via email to