Thanks Sean. I will check how to support 0.9 with CDH4.
However 0.9 has solved my problem..... On Tue, Feb 18, 2014 at 7:45 PM, Sean Owen <sro...@gmail.com> wrote: > FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine > with CDH4. You do have to build with the Hadoop 2.x profile, as usual. > > On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: >> Bikash, >> >> Don't use that version. Use a more recent release. We can't help that >> Cloudera has an old version. >> >> >> >> >> On Tue, Feb 18, 2014 at 1:26 AM, Bikash Gupta >> <bikash.gupt...@gmail.com>wrote: >> >>> Suneel, >>> >>> Thanks for the information. >>> >>> I am using 0.7 packaged with CDH . >>> >>> On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi <suneel_mar...@yahoo.com> >>> wrote: >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta < >>> bikash.gupt...@gmail.com> wrote: >>> > >>> > Ted/Peter, >>> > >>> > Thanks for the response. >>> > >>> > This is exactly what I am trying to achieve. May be I was not able to >>> > put my questions clearly. >>> > >>> > I am clustering on few variables of Customer/User(except their >>> > customer_id/user_id) and storing customer_id/user_id list in a >>> > separate place. >>> > >>> > Question) What is the approach to identify each member in each cluster >>> > by its unique id. >>> > Answer) I have to run a script post-clustering to map >>> > customer_id/user_id for the clustered output to identify the member >>> > uniquely. >>> > >>> >>> If u r working off of Mahout 0.9 u don't have to do that. The >>> Clustered output should display the vectors with the vectorid (user_id in >>> ur case) that belong to a specfic cluster along with the distance of that >>> vector from the cluster center. >>> > >>> > Correct me if I am wrong :) >>> > >>> > >>> > On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <ted.dunn...@gmail.com> >>> wrote: >>> >> Bikash, >>> >> >>> >> Peter is just right. >>> >> >>> >> Yes, you can cluster on these few variables that you have. Probably you >>> >> should translate location to x,y,z coordinates so that you don't have >>> >> strange geometry problems, but location, gender and age are quite >>> >> reasonable characteristics. You will get a fairly weak clustering since >>> >> these characteristics actually tell very little about people, but it is >>> a >>> >> start. >>> >> >>> >> You *don't* want to cluster using user ID for exactly the reasons that >>> >> Peter mentioned. Another way to put it is that the user ID tells you >>> >> absolutely nothing about the person and thus is not useful for the >>> >> clustering. >>> >> >>> >> You *do* have to retain the assignment of users to cluster and that >>> >> assignment is usually stored as a list of user ID's for each cluster. >>> This >>> >> does not at all imply, however, that the user ID was used to *form* the >>> >> cluster. >>> >> >>> >> >>> >> >>> >> >>> >> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann < >>> peter.jauma...@gmail.com>wrote: >>> >> >>> >>> Bikash, >>> >>> As Ted pointed out already...... >>> >>> You can cluster on all variables except your customer_id. That's your >>> >>> identifier. >>> >>> Customers within a cluster are 'similar'; how similar depends on the >>> >>> fidelity of your clustering. >>> >>> The clustering algorithm uses (you'll choose) some kind of distance, or >>> >>> similarity/dissimilarity >>> >>> measure (which one to use depends on the type of data you have). This >>> >>> measure will, >>> >>> eventually, determine how separate/how unique your clusters are. Goal >>> is to >>> >>> have your clusters distinct >>> >>> from each other but have the cluster members, within a cluster, as >>> similar >>> >>> as possible. >>> >>> >>> >>> In the output, each member in each cluster is uniquely identified by >>> it's >>> >>> customer_id, the cluster it belongs to, >>> >>> and a distance measure that shows (usually) how close, or not, the >>> >>> customer_id is from its cluster center. >>> >>> >>> >>> In terms of what you want to do, my assumption is that you'd like to >>> find >>> >>> out a structure, or patterns, >>> >>> within your customer base, based on a set of variables that you have. >>> This >>> >>> is often called a segmentation. >>> >>> >>> >>> Hope this helps! What you want to do is a pretty basic and >>> straight-forward >>> >>> application of clustering. >>> >>> Good luck, >>> >>> -Peter >>> >>> >>> >>> >>> >>> >>> >>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta < >>> bikash.gupt...@gmail.com >>> >>> >wrote: >>> >>> >>> >>> > Basically I am trying to achieve customer segmentation. >>> >>> > >>> >>> > Now to measure customer similarity within a cluster I need to >>> >>> > understand which two customer are similar. >>> >>> > >>> >>> > Assumption: To understand these customer uniquely I need to provide >>> >>> > their CustomerId >>> >>> > >>> >>> > Is my assumption correct? If yes then, will customerId affect the >>> >>> > clustering output >>> >>> > >>> >>> > If no then how can I identify customer uniquely >>> >>> > >>> >>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <ted.dunn...@gmail.com> >>> >>> > wrote: >>> >>> > > That really depends on what you want to do. >>> >>> > > >>> >>> > > What is it that you want? >>> >>> > > >>> >>> > > >>> >>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta < >>> >>> bikash.gupt...@gmail.com >>> >>> > >wrote: >>> >>> > > >>> >>> > >> Ok...so UserId is not a good field for this combination, but if I >>> want >>> >>> > >> User Clustering, what should be combination(just for >>> >>> > >> understanding)..... >>> >>> > >> >>> >>> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning < >>> ted.dunn...@gmail.com> >>> >>> > >> wrote: >>> >>> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta < >>> >>> > bikash.gupt...@gmail.com >>> >>> > >> >wrote: >>> >>> > >> > >>> >>> > >> >> Let say I am clustering users, I am providing their profile >>> data to >>> >>> > >> >> discover similarity between two user. >>> >>> > >> >> >>> >>> > >> >> So my input would be [UserId, Location, Age, Gender, Time >>> Created ] >>> >>> > >> >> >>> >>> > >> >> Now if my UserId length is of minimum 10 characters which is >>> >>> > >> >> comparative very large number than other categorical data. >>> >>> > >> >> >>> >>> > >> > >>> >>> > >> > User id is not a good field for clustering. >>> >>> > >> > >>> >>> > >> > Location is fine if you want geo-graphical clsutering. >>> >>> > >> > >>> >>> > >> > Location + age + gender is fine for geo-demo-graphical >>> clustering. >>> >>> > >> > >>> >>> > >> > Adding time created might give a tiny bit of insight. >>> >>> > >> > >>> >>> > >> > But these fields are not going to lead to great insights. >>> >>> > >> >>> >>> > >> >>> >>> > >> >>> >>> > >> -- >>> >>> > >> Thanks & Regards >>> >>> > >> Bikash Kumar Gupta >>> > >>> >>> > >> >>> >>> > >>> >>> > >>> >>> > >>> >>> > -- >>> >>> > Thanks & Regards >>> >>> > Bikash Kumar Gupta >>> >>> > >>> >>> >>> > >>> > >>> > >>> > -- >>> > Thanks & Regards >>> > Bikash Kumar Gupta >>> >>> >>> >>> -- >>> Thanks & Regards >>> Bikash Kumar Gupta >>> -- Thanks & Regards Bikash Kumar Gupta