Re: K-Means on Hadoop Cluster

Adri Gómez Mon, 26 May 2014 11:53:30 -0700

I think I understand.

If I edit my data like the example of synthetic data: rows separated by
enter, do you think I could run the example with my data?



2014-05-25 18:39 GMT+02:00 Pat Ferrel <pat.fer...@gmail.com>:

> You need to create a Mahout distributed row matrix, which is one or more
> SequenceFiles of:
> <IntWritable>: <VectorWritable>
>
> The vector will have all your values, the first IntWritable has the Mahout
> ID/key for the vector. It is a positive ordinal. Usually this corresponds
> to some ID you have for the vector so you create a Mahout Int for each new
> vector, and put it in a dictionary that maps your id to/from the Mahout id.
> Then after clustering you map the mahout ID back to yours.
>
> The VectorWritable is created with a Vector. As you have stated things you
> would use a DenseVector implementation. If you have a lot of 0s you may
> want to give your columns Mahout IDs too and use sparse vectors to create a
> sparse matrix. All missing values are assumed to have a 0 value. This may
> improve the performance. It will also allow you to use an implementation of
> Vector called NamedVector, which allows you to put your ID in the Vector as
> a string to follow the vector through the calculations.
>
>
> On May 24, 2014, at 11:35 AM, Adri Gómez <adri12...@gmail.com> wrote:
>
> Hello.
>
> First, sorry for my English.
>
> I'm a noob in Mahout and Hadoop. I want to run kmeans clustering on a
> Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file,
> with 38 numeric features for each vector, like this: 0 0 1 0 0 0 0 0 0 0 0
> 0 ...
>
> I've run the examples that I've found, like Reuters (
> https://mahout.apache.org/users/clustering/k-means-clustering.html) or
> synthetic data. I know i have to convert this vectors to SequenceFile, but
> I don't know if I have to do something more before.
>
> I'm using Mahout 0.7 and Hadoop 1.2.1.
>
> Thanks.
>
> --
> *Gómez Muñoz, Adrián.*
>
>


-- 
*Gómez Muñoz, Adrián.*

Re: K-Means on Hadoop Cluster

Reply via email to