I think I understand. If I edit my data like the example of synthetic data: rows separated by enter, do you think I could run the example with my data?
2014-05-25 18:39 GMT+02:00 Pat Ferrel <pat.fer...@gmail.com>: > You need to create a Mahout distributed row matrix, which is one or more > SequenceFiles of: > <IntWritable>: <VectorWritable> > > The vector will have all your values, the first IntWritable has the Mahout > ID/key for the vector. It is a positive ordinal. Usually this corresponds > to some ID you have for the vector so you create a Mahout Int for each new > vector, and put it in a dictionary that maps your id to/from the Mahout id. > Then after clustering you map the mahout ID back to yours. > > The VectorWritable is created with a Vector. As you have stated things you > would use a DenseVector implementation. If you have a lot of 0s you may > want to give your columns Mahout IDs too and use sparse vectors to create a > sparse matrix. All missing values are assumed to have a 0 value. This may > improve the performance. It will also allow you to use an implementation of > Vector called NamedVector, which allows you to put your ID in the Vector as > a string to follow the vector through the calculations. > > > On May 24, 2014, at 11:35 AM, Adri Gómez <adri12...@gmail.com> wrote: > > Hello. > > First, sorry for my English. > > I'm a noob in Mahout and Hadoop. I want to run kmeans clustering on a > Hadoop pseudo-distributed mode. I have 5 million of vectors in a .mat file, > with 38 numeric features for each vector, like this: 0 0 1 0 0 0 0 0 0 0 0 > 0 ... > > I've run the examples that I've found, like Reuters ( > https://mahout.apache.org/users/clustering/k-means-clustering.html) or > synthetic data. I know i have to convert this vectors to SequenceFile, but > I don't know if I have to do something more before. > > I'm using Mahout 0.7 and Hadoop 1.2.1. > > Thanks. > > -- > *Gómez Muñoz, Adrián.* > > -- *Gómez Muñoz, Adrián.*