Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Bikash Gupta
Hi, Just to clear my below question I am citing an another example Let say I will be clustering on any User's monthly summarized data UserID, Transaction, Quantity, Discount Question 1) If I input UserID, Transaction, Quantity, Discount in Kmeans, will the output would be accurate as ideally

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Ted Dunning
Think about the question in terms of whether this will define a reasonable kind of distance between items or users. Can you first define what you want to do? Are you clustering users? Are you clustering items? If users, how could the data you provide give any kind of idea about which users are

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Bikash Gupta
Hi Ted, Thanks, this helped to think align. However a new question came into mind. Let say I am clustering users, I am providing their profile data to discover similarity between two user. So my input would be [UserId, Location, Age, Gender, Time Created ] Now if my UserId length is of minimum

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Ted Dunning
On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta bikash.gupt...@gmail.comwrote: Let say I am clustering users, I am providing their profile data to discover similarity between two user. So my input would be [UserId, Location, Age, Gender, Time Created ] Now if my UserId length is of minimum 10

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Bikash Gupta
Ok...so UserId is not a good field for this combination, but if I want User Clustering, what should be combination(just for understanding). On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta bikash.gupt...@gmail.comwrote:

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Ted Dunning
That really depends on what you want to do. What is it that you want? On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta bikash.gupt...@gmail.comwrote: Ok...so UserId is not a good field for this combination, but if I want User Clustering, what should be combination(just for understanding).

reduce is too slow in StreamingKmeans

2014-02-17 Thread Sylvia Ma
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found that reduce of mahout streamingkmeans is extremely slow. For example: With a dataset of 200 objects, 128 variables, I would like to get 1 clusters. The command executed is as the following. mahout streamingkmeans

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Bikash Gupta
Basically I am trying to achieve customer segmentation. Now to measure customer similarity within a cluster I need to understand which two customer are similar. Assumption: To understand these customer uniquely I need to provide their CustomerId Is my assumption correct? If yes then, will

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Peter Jaumann
Bikash, As Ted pointed out already.. You can cluster on all variables except your customer_id. That's your identifier. Customers within a cluster are 'similar'; how similar depends on the fidelity of your clustering. The clustering algorithm uses (you'll choose) some kind of distance, or

Re: [Edit] Approach for Clustering Data

2014-02-17 Thread Ted Dunning
Bikash, Peter is just right. Yes, you can cluster on these few variables that you have. Probably you should translate location to x,y,z coordinates so that you don't have strange geometry problems, but location, gender and age are quite reasonable characteristics. You will get a fairly weak