Re: k-means issues

Suneel Marthi Thu, 01 Aug 2013 11:59:54 -0700

You also need to specify the distance measure '-dm' to clusterdump. This is the 
Distance Measure that was used for clustering.

(Again look at the example in /examples/bin/cluster-reuters.sh - it has all the 
steps u r trying to accomplish)

________________________________
 From: Marco <zentrop...@yahoo.co.uk>
To: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi 
<suneel_mar...@yahoo.com> 
Sent: Thursday, August 1, 2013 2:51 PM
Subject: Re: k-means issues

 mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final/part-r-00000 -n 20 -b 100 -o cdump.txt 
-p mahout/kmeans-clusters/clusteredPoints

----- Messaggio originale -----
Da: Suneel Marthi <suneel_mar...@yahoo.com>
A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco 
<zentrop...@yahoo.co.uk>
Cc: 
Inviato: Giovedì 1 Agosto 2013 17:24
Oggetto: Re: k-means issues

Could u post the Command line u r using for clusterdump?

________________________________
From: Marco <zentrop...@yahoo.co.uk>
To: "user@mahout.apache.org" <user@mahout.apache.org>; Suneel Marthi 
<suneel_mar...@yahoo.com> 
Sent: Thursday, August 1, 2013 10:29 AM
Subject: Re: k-means issues

ok i did put -cl and got clusteredPoints, but then I do clusterdump and always 
get "Wrote 0 clusters"

----- Messaggio originale -----
Da: Suneel Marthi <suneel_mar...@yahoo.com>
A: "user@mahout.apache.org" <user@mahout.apache.org>; Marco 
<zentrop...@yahoo.co.uk>
Cc: 
Inviato: Giovedì 1 Agosto 2013 16:04
Oggetto: Re: k-means issues

Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command. 

________________________________
From: Marco <zentrop...@yahoo.co.uk>
To: "user@mahout.apache.org" <user@mahout.apache.org> 
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues

So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like "china    
japan    senkaku    dispute" or "italy   lampedusa   immgration").

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-00000 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the "Input Vectors: {}" part puzzles me. 

Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?

Re: k-means issues

Reply via email to