Re: [Edit] Approach for Clustering Data

Bikash Gupta Tue, 18 Feb 2014 06:22:48 -0800

Thanks Sean.

I will check how to support 0.9 with CDH4.


However 0.9 has solved my problem.....

On Tue, Feb 18, 2014 at 7:45 PM, Sean Owen <sro...@gmail.com> wrote:
> FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine
> with CDH4. You do have to build with the Hadoop 2.x profile, as usual.
>
> On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>> Bikash,
>>
>> Don't use that version.  Use a more recent release.  We can't help that
>> Cloudera has an old version.
>>
>>
>>
>>
>> On Tue, Feb 18, 2014 at 1:26 AM, Bikash Gupta 
>> <bikash.gupt...@gmail.com>wrote:
>>
>>> Suneel,
>>>
>>> Thanks for the information.
>>>
>>> I am using 0.7 packaged with CDH .
>>>
>>> On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi <suneel_mar...@yahoo.com>
>>> wrote:
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta <
>>> bikash.gupt...@gmail.com> wrote:
>>> >
>>> > Ted/Peter,
>>> >
>>> > Thanks for the response.
>>> >
>>> > This is exactly what I am trying to achieve. May be I was not able to
>>> > put my questions clearly.
>>> >
>>> > I am clustering on few variables of Customer/User(except their
>>> > customer_id/user_id) and storing customer_id/user_id list in a
>>> > separate place.
>>> >
>>> > Question) What is the approach to identify each member in each cluster
>>> > by its unique id.
>>> > Answer) I have to run a script post-clustering to map
>>> > customer_id/user_id for the clustered output to identify the member
>>> > uniquely.
>>> >
>>> >>> If u r working off of Mahout 0.9 u don't have to do that. The
>>> Clustered output should display the vectors with the vectorid (user_id in
>>> ur case) that belong to a specfic cluster along with the distance of that
>>> vector from the cluster center.
>>> >
>>> > Correct me if I am wrong :)
>>> >
>>> >
>>> > On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <ted.dunn...@gmail.com>
>>> wrote:
>>> >> Bikash,
>>> >>
>>> >> Peter is just right.
>>> >>
>>> >> Yes, you can cluster on these few variables that you have.  Probably you
>>> >> should translate location to x,y,z coordinates so that you don't have
>>> >> strange geometry problems, but location, gender and age are quite
>>> >> reasonable characteristics.  You will get a fairly weak clustering since
>>> >> these characteristics actually tell very little about people, but it is
>>> a
>>> >> start.
>>> >>
>>> >> You *don't* want to cluster using user ID for exactly the reasons that
>>> >> Peter mentioned.  Another way to put it is that the user ID tells you
>>> >> absolutely nothing about the person and thus is not useful for the
>>> >> clustering.
>>> >>
>>> >> You *do* have to retain the assignment of users to cluster and that
>>> >> assignment is usually stored as a list of user ID's for each cluster.
>>>  This
>>> >> does not at all imply, however, that the user ID was used to *form* the
>>> >> cluster.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann <
>>> peter.jauma...@gmail.com>wrote:
>>> >>
>>> >>> Bikash,
>>> >>> As Ted pointed out already......
>>> >>> You can cluster on all variables except your customer_id. That's your
>>> >>> identifier.
>>> >>> Customers within a cluster are 'similar'; how similar depends on the
>>> >>> fidelity of your clustering.
>>> >>> The clustering algorithm uses (you'll choose) some kind of distance, or
>>> >>> similarity/dissimilarity
>>> >>> measure (which one to use depends on the type of data you have). This
>>> >>> measure will,
>>> >>> eventually, determine how separate/how unique your clusters are. Goal
>>> is to
>>> >>> have your clusters distinct
>>> >>> from each other but have the cluster members, within a cluster, as
>>> similar
>>> >>> as possible.
>>> >>>
>>> >>> In the output, each member in each cluster is uniquely identified by
>>> it's
>>> >>> customer_id, the cluster it belongs to,
>>> >>> and a distance measure that shows (usually) how close, or not, the
>>> >>> customer_id is from its cluster center.
>>> >>>
>>> >>> In terms of what you want to do, my assumption is that you'd like to
>>> find
>>> >>> out a structure, or patterns,
>>> >>> within your customer base, based on a set of variables that you have.
>>> This
>>> >>> is often called a segmentation.
>>> >>>
>>> >>> Hope this helps! What you want to do is a pretty basic and
>>> straight-forward
>>> >>> application of clustering.
>>> >>> Good luck,
>>> >>> -Peter
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <
>>> bikash.gupt...@gmail.com
>>> >>> >wrote:
>>> >>>
>>> >>> > Basically I am trying to achieve customer segmentation.
>>> >>> >
>>> >>> > Now to measure customer similarity within a cluster I need to
>>> >>> > understand which two customer are similar.
>>> >>> >
>>> >>> > Assumption: To understand these customer uniquely I need to provide
>>> >>> > their CustomerId
>>> >>> >
>>> >>> > Is my assumption correct? If yes then, will customerId affect the
>>> >>> > clustering output
>>> >>> >
>>> >>> > If no then how can I identify customer uniquely
>>> >>> >
>>> >>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <ted.dunn...@gmail.com>
>>> >>> > wrote:
>>> >>> > > That really depends on what you want to do.
>>> >>> > >
>>> >>> > > What is it that you want?
>>> >>> > >
>>> >>> > >
>>> >>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta <
>>> >>> bikash.gupt...@gmail.com
>>> >>> > >wrote:
>>> >>> > >
>>> >>> > >> Ok...so UserId is not a good field for this combination, but if I
>>> want
>>> >>> > >> User Clustering, what should be combination(just for
>>> >>> > >> understanding).....
>>> >>> > >>
>>> >>> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning <
>>> ted.dunn...@gmail.com>
>>> >>> > >> wrote:
>>> >>> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta <
>>> >>> > bikash.gupt...@gmail.com
>>> >>> > >> >wrote:
>>> >>> > >> >
>>> >>> > >> >> Let say I am clustering users, I am providing their profile
>>> data to
>>> >>> > >> >> discover similarity between two user.
>>> >>> > >> >>
>>> >>> > >> >> So my input would be [UserId, Location, Age, Gender, Time
>>> Created ]
>>> >>> > >> >>
>>> >>> > >> >> Now if my UserId length is of minimum 10 characters which is
>>> >>> > >> >> comparative very large number than other categorical data.
>>> >>> > >> >>
>>> >>> > >> >
>>> >>> > >> > User id is not a good field for clustering.
>>> >>> > >> >
>>> >>> > >> > Location is fine if you want geo-graphical clsutering.
>>> >>> > >> >
>>> >>> > >> > Location + age + gender is fine for geo-demo-graphical
>>> clustering.
>>> >>> > >> >
>>> >>> > >> > Adding time created might give a tiny bit of insight.
>>> >>> > >> >
>>> >>> > >> > But these fields are not going to lead to great insights.
>>> >>> > >>
>>> >>> > >>
>>> >>> > >>
>>> >>> > >> --
>>> >>> > >> Thanks & Regards
>>> >>> > >> Bikash Kumar Gupta
>>> >
>>> >>> > >>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Thanks & Regards
>>> >>> > Bikash Kumar Gupta
>>> >>> >
>>> >>>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Bikash Kumar Gupta
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Bikash Kumar Gupta
>>>



-- 
Thanks & Regards
Bikash Kumar Gupta

Re: [Edit] Approach for Clustering Data

Reply via email to