Hi Valerio,

All the Mahout clustering implementations operate over Hadoop sequence files of Mahout type VectorWritable. These entities allow you to represent dense or sparse numeric information which may be further annotated by NamedVector wrappers to encode vector names in the data set. If you can run Hadoop jobs or call Java from weka then you may be able to use our code directly. Look at the driver class under each algorithm for entry points. If all else fails we also have a command line interface.

All the clustering jobs accept VectorWritable input files and produce Hadoop directories (clusters-i) containing the Clusters produced by the particular clustering iteration(s) plus an optional directory (clusteredPoints) containing sequence files of clustered points which are keyed by the clusterId and contain WeightedVectorWritable wrappers around the original input vector. These wrappers encode the pdf of the cluster assignment.

Hope this helps,
Jeff

On 8/27/10 12:06 PM, Valerio wrote:
hi all,

I need some guides that explain how to use mahout with the kmeans algorithm and
first of all,what type of dataset mahout uses?
I'm doing my thesis and I must run a k means clustering on weka,but weka must
call hadoop in background to parallelize the job. I discovered that mahout run
the kmeans on hadoop so i will call it from weka,but I don't understand what
type of files the kmeans of mahout read as input and how it works.

can someone help me?

Thanks all,
Valerio Ceraudo



Reply via email to