Hi All I'm using CanopyClusterer, the input is vectors of Type
RandomAccessSparseVector, each vector may have 1~99 attributes. When I'm
running CanopyClusterer on Hadoop, i find it was very very slow, so i get the
stacktrace of the map tasks, i find the following output: at
org.apache.mahout.clustering.AbstractCluster.formatVector(AbstractCluster.java:301)
at
org.apache.mahout.clustering.canopy.CanopyClusterer.addPointToCanopies(CanopyClusterer.java:161)
At line 161 of CanopyClusterer, it's just a log output statement, it
should have something like this "if(log.isDebugEnabled())" to avoid running if
the log level is not debug, but this is not the root cause, the root cause in
my case is AbstractCluster.formatVector is so slow to complete, after i comment
"AbstractCluster.formatVector" everything goes well, can any body have a look
at this, thank you very much. Cheers Ramon