I see the error below: On Tuesday, March 10, 2015 11:45 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote:
Try ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-00000 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c <some-folder> -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 I don't have a machine before me, so no way to try this out. But IIRC the way this works is : a) u specify an initial seed of centroids via -c , u then don't need to specify k, since the # of centroids specified as seed would be the k b) u let the algorithm choose random centroids by specifying -k, it needs -c to write the random centroids to hence -c is needed with -k. On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer <alwaysra...@yahoo.com> wrote: ok so if -c is required then how can i give it or atleast is there a way to remove -k itself? ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-00000 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 and ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-00000 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer <alwaysra...@yahoo.com.invalid> wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer <alwaysra...@yahoo.com.invalid> wrote: > Hi All, > I am trying to run the command: > ./mahout kmeans -i > hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-00000 > -o > hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer > -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm > org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 > -xm mapreduce > Since i dont have any clusters yet to give it as an input i can remove it > is what forums suggested. But now i get the error > > Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= > MAHOUT-JOB: > /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar > 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option > --clusters > Missing required option > --clusters > > Usage: > [--input <input> --output <output> --distanceMeasure > <distanceMeasure> > --clusters <clusters> --numClusters <k> --randomSeed > <randomSeed1> > [<randomSeed2> ...] --convergenceDelta <convergenceDelta> --maxIter > <maxIter> > --overwrite --clustering --method <method> > --outlierThreshold > <outlierThreshold> --help --tempDir <tempDir> --startPhase > <startPhase> > --endPhase > <endPhase>] > --clusters (-c) clusters The input centroids, as Vectors. Must be > a > SequenceFile of Writable, Cluster/Canopy. If > k is > also specified, then a random set of vectors > will > be selected and written out to this path > first > 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: > 0.006166666666666667) > Kindly help me out. > Thanks > > >