I see the error below: Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 15/03/10 11:50:20 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[hdfs://master:54310/user/netlog/upload/mahoutoutput], --convergenceDelta=[0.5], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-00000], --maxIter=[5], --method=[mapreduce], --numClusters=[25], --output=[hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 15/03/10 11:50:21 INFO common.HadoopUtil: Deleting hdfs://master:54310/user/netlog/upload/mahoutoutput 15/03/10 11:50:21 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 15/03/10 11:50:21 INFO compress.CodecPool: Got brand-new compressor [.deflate] 15/03/10 11:50:21 INFO kmeans.RandomSeedGenerator: Wrote 25 Klusters to hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed 15/03/10 11:50:21 INFO kmeans.KMeansDriver: Input: hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-00000 Clusters In: hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed Out: hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer 15/03/10 11:50:21 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 5 15/03/10 11:50:21 INFO compress.CodecPool: Got brand-new decompressor [.deflate] Exception in thread "main" java.lang.IllegalStateException: No input clusters found in hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed. Check your -c argument. at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)
On Tuesday, March 10, 2015 11:53 AM, Raghuveer <alwaysra...@yahoo.com.INVALID> wrote: I see the error below: On Tuesday, March 10, 2015 11:45 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: Try ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-00000 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c <some-folder> -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 I don't have a machine before me, so no way to try this out. But IIRC the way this works is : a) u specify an initial seed of centroids via -c , u then don't need to specify k, since the # of centroids specified as seed would be the k b) u let the algorithm choose random centroids by specifying -k, it needs -c to write the random centroids to hence -c is needed with -k. On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer <alwaysra...@yahoo.com> wrote: ok so if -c is required then how can i give it or atleast is there a way to remove -k itself? ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-00000 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 and ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-00000 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer <alwaysra...@yahoo.com.invalid> wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer <alwaysra...@yahoo.com.invalid> wrote: > Hi All, > I am trying to run the command: > ./mahout kmeans -i > hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-00000 > -o > hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer > -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm > org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 > -xm mapreduce > Since i dont have any clusters yet to give it as an input i can remove it > is what forums suggested. But now i get the error > > Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= > MAHOUT-JOB: > /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar > 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option > --clusters > Missing required option > --clusters > > Usage: > [--input <input> --output <output> --distanceMeasure > <distanceMeasure> > --clusters <clusters> --numClusters <k> --randomSeed > <randomSeed1> > [<randomSeed2> ...] --convergenceDelta <convergenceDelta> --maxIter > <maxIter> > --overwrite --clustering --method <method> > --outlierThreshold > <outlierThreshold> --help --tempDir <tempDir> --startPhase > <startPhase> > --endPhase > <endPhase>] > --clusters (-c) clusters The input centroids, as Vectors. Must be > a > SequenceFile of Writable, Cluster/Canopy. If > k is > also specified, then a random set of vectors > will > be selected and written out to this path > first > 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: > 0.006166666666666667) > Kindly help me out. > Thanks > > >