Hi,
Just to clear my below question I am citing an another example
Let say I will be clustering on any User's monthly summarized data
UserID, Transaction, Quantity, Discount
Question 1) If I input UserID, Transaction, Quantity, Discount in
Kmeans, will the output would be accurate as ideally
Think about the question in terms of whether this will define a reasonable
kind of distance between items or users.
Can you first define what you want to do? Are you clustering users? Are
you clustering items?
If users, how could the data you provide give any kind of idea about which
users are
Hi Ted,
Thanks, this helped to think align. However a new question came into mind.
Let say I am clustering users, I am providing their profile data to
discover similarity between two user.
So my input would be [UserId, Location, Age, Gender, Time Created ]
Now if my UserId length is of minimum
On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta bikash.gupt...@gmail.comwrote:
Let say I am clustering users, I am providing their profile data to
discover similarity between two user.
So my input would be [UserId, Location, Age, Gender, Time Created ]
Now if my UserId length is of minimum 10
Ok...so UserId is not a good field for this combination, but if I want
User Clustering, what should be combination(just for
understanding).
On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning ted.dunn...@gmail.com wrote:
On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta bikash.gupt...@gmail.comwrote:
That really depends on what you want to do.
What is it that you want?
On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta bikash.gupt...@gmail.comwrote:
Ok...so UserId is not a good field for this combination, but if I want
User Clustering, what should be combination(just for
understanding).
I am using mahout 0.8 embedded in chd5.0.0 provided by cloudera and found
that reduce of mahout streamingkmeans is extremely slow.
For example:
With a dataset of 200 objects, 128 variables, I would like to get 1
clusters.
The command executed is as the following.
mahout streamingkmeans
Basically I am trying to achieve customer segmentation.
Now to measure customer similarity within a cluster I need to
understand which two customer are similar.
Assumption: To understand these customer uniquely I need to provide
their CustomerId
Is my assumption correct? If yes then, will
Bikash,
As Ted pointed out already..
You can cluster on all variables except your customer_id. That's your
identifier.
Customers within a cluster are 'similar'; how similar depends on the
fidelity of your clustering.
The clustering algorithm uses (you'll choose) some kind of distance, or
Bikash,
Peter is just right.
Yes, you can cluster on these few variables that you have. Probably you
should translate location to x,y,z coordinates so that you don't have
strange geometry problems, but location, gender and age are quite
reasonable characteristics. You will get a fairly weak
10 matches
Mail list logo