I think you are on the right track but I have some suggestions: - How many shops do you have in your DB? Unless you have billions of them, you can likely run the sequential (-xm sequential) algorithms which run locally and are much faster. - You will want to produce NamedVectors from your database, with the shop_id as the name and the category vectors as the delegate. I'm not sure if the Mahout ARFF converter will do this for you or not. It may be simpler to write your own converter using org.apache.mahout.clustering.conversion.InputDriver/Mapper as prototypes. These will convert space-delimited files to Mahout Vectors but will not produce NamedVectors. Nor will they produce a dictionary file but your categories seem simple enough to forego that. - Once you have created a directory of NV sequence files you should be able to cluster them easily.
Smooth sailing, Jeff -----Original Message----- From: Clément Notin [mailto:[email protected]] Sent: Wednesday, August 03, 2011 7:03 AM To: [email protected] Subject: Am I starting right with clustering ? Hello, I'm new in the Mahout world and it seems really nice but it's hard to get easy documentation :( I'm trying to run some clustering. Let me explain you what I'm trying to achieve. I have a DB with columns : shop_id (string), customer_category (string), num_of_purchases (integer) What I want to do is to discover groups of shops which are related because they have some customers categories in common. I think the vectors should be : "shop #1" = (1, 10, 0, 20) which means that the customers category A has bought 1 thing in the shop, the customers category B has bought 10 things in the shop and so... In my BD for this example I have : shop_id | customer_category | num_of_purchases --------------+-----------------------------+--------------------- "shop #1" | "A" | 1 "shop #1" | "B" | 10 "shop #1" | "D" | 20 I think I must convert this to an ARFF file like : @RELATION purchases @ATTRIBUTE shop_id STRING @ATTRIBUTE catA NUMERIC @ATTRIBUTE catB NUMERIC @ATTRIBUTE catC NUMERIC @ATTRIBUTE catD NUMERIC @DATA "shop #1",1,10,0,20 ... Why ARFF file ? Because I can use the helpful sparse syntax. But it's difficult to build this file. I think I should write a script. My question is, am I heading in the good direction ? I would appreciate some help ! Thanks :) Regards, -- *Clément **Notin*
