Re: [Edit] Approach for Clustering Data

Suneel Marthi Tue, 18 Feb 2014 00:46:28 -0800




On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta <bikash.gupt...@gmail.com> 
wrote:
 
Ted/Peter,

Thanks for the response.

This is exactly what I am trying to achieve. May be I was not able to
put my questions clearly.

I am clustering on few variables of Customer/User(except their
customer_id/user_id) and storing customer_id/user_id list in a
separate place.

Question) What is the approach to identify each member in each cluster
by its unique id.
Answer) I have to run a script post-clustering to map
customer_id/user_id for the clustered output to identify the member
uniquely.

>> If u r working off of Mahout 0.9 u don't have to do that. The Clustered 
>> output should display the vectors with the vectorid (user_id in ur case) 
>> that belong to a specfic cluster along with the distance of that vector from 
>> the cluster center.

Correct me if I am wrong :)


On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> Bikash,
>
> Peter is just right.
>
> Yes, you can cluster on these few variables that you have.  Probably you
> should translate location to x,y,z coordinates so that you don't have
> strange geometry problems, but location, gender and age are quite
> reasonable characteristics.  You will get a fairly weak clustering since
> these characteristics actually tell very little about people, but it is a
> start.
>
> You *don't* want to cluster using user ID for exactly the reasons that
> Peter mentioned.  Another way to put it is that the user ID tells you
> absolutely nothing about the person and thus is not useful for the
> clustering.
>
> You *do* have to retain the assignment of users to cluster and that
> assignment is usually stored as a list of user ID's for each cluster.  This
> does not at all imply, however, that the user ID was used to *form* the
> cluster.
>
>
>
>
> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann 
> <peter.jauma...@gmail.com>wrote:
>
>> Bikash,
>> As Ted pointed out already......
>> You can cluster on all variables except your customer_id. That's your
>> identifier.
>> Customers within a cluster are 'similar'; how similar depends on the
>> fidelity of your clustering.
>> The clustering algorithm uses (you'll choose) some kind of distance, or
>> similarity/dissimilarity
>> measure (which one to use depends on the type of data you have). This
>> measure will,
>> eventually, determine how separate/how unique your clusters are. Goal is to
>> have your clusters distinct
>> from each other but have the cluster members, within a cluster, as similar
>> as possible.
>>
>> In the output, each member in each cluster is uniquely identified by it's
>> customer_id, the cluster it belongs to,
>> and a distance measure that shows (usually) how close, or not, the
>> customer_id is from its cluster center.
>>
>> In terms of what you want to do, my assumption is that you'd like to find
>> out a structure, or patterns,
>> within your customer base, based on a set of variables that you have. This
>> is often called a segmentation.
>>
>> Hope this helps! What you want to do is a pretty basic and straight-forward
>> application of clustering.
>> Good luck,
>> -Peter
>>
>>
>>
>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <bikash.gupt...@gmail.com
>> >wrote:
>>
>> > Basically I am trying to achieve customer segmentation.
>> >
>> > Now to measure customer similarity within a cluster I need to
>> > understand which two customer are similar.
>> >
>> > Assumption: To understand these customer uniquely I need to provide
>> > their CustomerId
>> >
>> > Is my assumption correct? If yes then, will customerId affect the
>> > clustering output
>> >
>> > If no then how can I identify customer uniquely
>> >
>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <ted.dunn...@gmail.com>
>> > wrote:
>> > > That really depends on what you want to do.
>> > >
>> > > What is it that you want?
>> > >
>> > >
>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta <
>> bikash.gupt...@gmail.com
>> > >wrote:
>> > >
>> > >> Ok...so UserId is not a good field for this combination, but if I want
>> > >> User Clustering, what should be combination(just for
>> > >> understanding).....
>> > >>
>> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning <ted.dunn...@gmail.com>
>> > >> wrote:
>> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta <
>> > bikash.gupt...@gmail.com
>> > >> >wrote:
>> > >> >
>> > >> >> Let say I am clustering users, I am providing their profile data to
>> > >> >> discover similarity between two user.
>> > >> >>
>> > >> >> So my input would be [UserId, Location, Age, Gender, Time Created ]
>> > >> >>
>> > >> >> Now if my UserId length is of minimum 10 characters which is
>> > >> >> comparative very large number than other categorical data.
>> > >> >>
>> > >> >
>> > >> > User id is not a good field for clustering.
>> > >> >
>> > >> > Location is fine if you want geo-graphical clsutering.
>> > >> >
>> > >> > Location + age + gender is fine for geo-demo-graphical clustering.
>> > >> >
>> > >> > Adding time created might give a tiny bit of insight.
>> > >> >
>> > >> > But these fields are not going to lead to great insights.
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Thanks & Regards
>> > >> Bikash Kumar Gupta

>> > >>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Bikash Kumar Gupta
>> >
>>



-- 
Thanks & Regards
Bikash Kumar Gupta
Re: [Edit] Approach for Clustering Data

Reply via email to