Re: User based recommender
On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com wrote: I have multiple different columns such as category,shipping location,item price,online user, etc. How can i use all these different columns and improve recommendation quality(ie.calculate more precise similarity between users by use of location,item price) ? For some kinds of information, you can build cross recommenders off of that other information. That incorporates this other information in an item-based system. Simply hand coding a similarity usually doesn't work well. The problem is that you don't really know which factors really represent actionable and non-redundant user similarity.
Re: User based recommender
Calculating similarity using multiple column values is what i thought,I looked throught the example but there is just some mention of use of content filtering implemented but not explicitly. Can you guide me to a working example or do i need to use algorithms for classifiers or clustering? Also if i have to can i implement the results using the recommenders provided in mahout. Best Regards, Yash Patel On Wed, Dec 3, 2014 at 3:43 PM, parnab kumar parnab.2...@gmail.com wrote: 1. why not use the other columns as evidences and come up with a preference score UID ITEMID PREF_SCORE and see if results improve.. May be you use other machine learning algorithms to generate such preference scores... 2. one other way may be to implement a custom similarity Score and not the ones that ships with mahout where you can use this column values to decide on the similarity of the users. Kindly have a look at mahout in action. There is an example for dating recommendation. This problem of yours USERID,ITEMID can mapped back to the same problem mentioned. try implementing the similarity score using the other column values. May be some expert in this area can come up with a better solution...if i were you i would certainly test the waters like the way i mentioned... Parnab... CSE, IIT Kharagur BIS, University College Cork
Few Questions related Mahout used for Text Clustering
Hi Mahout Users! Firstly, this community is great and appreciate all the Q A back and forth! I am currently working on Text Clustering and I am using Mahout and Clustering algorithms (kmeans, krunner, canopy etc) for that. If anyone has worked on a similar project please let me know. I have a 2 questions as below – 1. In order to choose optimal K, I am running krunner across my vectorized dataset. In order to choose the right “k”, I am trying to understand the spread of my observations across all clusters and minimize cluster 1 (which apparently looks like the catch-all bucket – can anyone confirm?), but I am observing the final count varies depending on k. See below (please ignore the blank cells) – Any idea why the final count varies depending on chosen k? [image: Inline image 1] 2. Another thing I noticed, some of my clusters have just n=1 observation? That doesn’t make sense to me. Is there a way to avoid this, any particular parameter selection I can tweak? Thank you and looking forward to your reply. Cheers, Viral
Process UnStructured Data in Mahout for Clustering
Hi All, I have been trying mahout clustering on unstructured data i.e human written data . I have tried mahout clustering algorithms like Kmeans,Canopy+Kmeans and LDA but the results produced are not help full . i see the problem is with the way data is written , Can some one please provide me some pointers on how to proceed with unstructured data for clustering. i have written and analyzer that uses lower-Case and stop-words filter also . thanks :) Regards, Shaikh Shahid G . +91 9503954781
Re: Process UnStructured Data in Mahout for Clustering
Hi it depends on the nature of data you are clustering. If you have knowledge about your data, you can figure out the results and you can also set the correct parameters to the clustering algorithm like number of topics or number of clusters. Cheers, Donni On Thu, Dec 4, 2014 at 2:38 PM, Shahid Shaikh shaikhshah...@gmail.com wrote: Hi All, I have been trying mahout clustering on unstructured data i.e human written data . I have tried mahout clustering algorithms like Kmeans,Canopy+Kmeans and LDA but the results produced are not help full . i see the problem is with the way data is written , Can some one please provide me some pointers on how to proceed with unstructured data for clustering. i have written and analyzer that uses lower-Case and stop-words filter also . thanks :) Regards, Shaikh Shahid G . +91 9503954781
Re: User based recommender
Cross Recommendors dont seem applicable because this dataset doesn't represent different actions by a user,it just contains transaction history.(ie.customer id,item id,shipping location,sales amount of that item,item category etc) Maybe location,sales per item(similarity might lead to knowledge of people who share same purchasing patterns) etc. On Wed, Dec 3, 2014 at 5:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com wrote: I have multiple different columns such as category,shipping location,item price,online user, etc. How can i use all these different columns and improve recommendation quality(ie.calculate more precise similarity between users by use of location,item price) ? For some kinds of information, you can build cross recommenders off of that other information. That incorporates this other information in an item-based system. Simply hand coding a similarity usually doesn't work well. The problem is that you don't really know which factors really represent actionable and non-redundant user similarity.
Re: Process UnStructured Data in Mahout for Clustering
Hey Donni thanks but I have used the configurations and obtained the clusters .the results are not promising enough . I was looking if there are any known technics I can follow specifically while generating vectors . Thanks On Thursday, December 4, 2014, Donni Khan prince.don...@googlemail.com wrote: Hi it depends on the nature of data you are clustering. If you have knowledge about your data, you can figure out the results and you can also set the correct parameters to the clustering algorithm like number of topics or number of clusters. Cheers, Donni On Thu, Dec 4, 2014 at 2:38 PM, Shahid Shaikh shaikhshah...@gmail.com wrote: Hi All, I have been trying mahout clustering on unstructured data i.e human written data . I have tried mahout clustering algorithms like Kmeans,Canopy+Kmeans and LDA but the results produced are not help full . i see the problem is with the way data is written , Can some one please provide me some pointers on how to proceed with unstructured data for clustering. i have written and analyzer that uses lower-Case and stop-words filter also . thanks :) Regards, Shaikh Shahid G . +91 9503954781 -- Regards, Shaikh Shahid G . +91 9503954781
Re: Process UnStructured Data in Mahout for Clustering
My experience has been that it's best to leave the data processing for Python. I strongly suggest you re-write your ETL and let Mahout only do the clustering. The built-in vectorization routines are fairly primitive. Then I would wash the features, basically set up your own list of stop words or phrases, before you let Mahout do anything. On Dec 4, 2014, at 8:38 AM, Shahid Shaikh shaikhshah...@gmail.com wrote: Hey Donni thanks but I have used the configurations and obtained the clusters .the results are not promising enough . I was looking if there are any known technics I can follow specifically while generating vectors . Thanks On Thursday, December 4, 2014, Donni Khan prince.don...@googlemail.com wrote: Hi it depends on the nature of data you are clustering. If you have knowledge about your data, you can figure out the results and you can also set the correct parameters to the clustering algorithm like number of topics or number of clusters. Cheers, Donni On Thu, Dec 4, 2014 at 2:38 PM, Shahid Shaikh shaikhshah...@gmail.com wrote: Hi All, I have been trying mahout clustering on unstructured data i.e human written data . I have tried mahout clustering algorithms like Kmeans,Canopy+Kmeans and LDA but the results produced are not help full . i see the problem is with the way data is written , Can some one please provide me some pointers on how to proceed with unstructured data for clustering. i have written and analyzer that uses lower-Case and stop-words filter also . thanks :) Regards, Shaikh Shahid G . +91 9503954781 -- Regards, Shaikh Shahid G . +91 9503954781
Re: User based recommender
User1 purchases = infant car seat, infant stroller User2 purchases = infant car seat, infant stroller, infant crib mobile The obvious recommendation for User1 is an infant crib mobile. From the purchase history the users look similar. Here similarity is in “taste”. User or item information that does not relate to taste may be misleading for recs. If you look at their profiles: User1: male, 55 years old, upper 75% income User2: female, 29 years old, lower 25% income User1 is actually a doting grandfather, User2 a doting mother. Their profiles are quite dissimilar though their taste is similar. The point being that those other pieces of data may not relate to user similarity *of taste*. Going through the cross-recommendation process applies cooccurrence analysis to the data that checks to see if the secondary data correlates in an important way with the action you know is important. For this reason it’s usually best to start out ignoring that information and using just UID ITEMID for the important action. Later you may find uses for the extra data, or may consider viewing or purchasing from a certain category as a secondary action and use cross-recommendations to improve things. On Dec 4, 2014, at 7:17 AM, Yash Patel yashpatel1...@gmail.com wrote: Cross Recommendors dont seem applicable because this dataset doesn't represent different actions by a user,it just contains transaction history.(ie.customer id,item id,shipping location,sales amount of that item,item category etc) Maybe location,sales per item(similarity might lead to knowledge of people who share same purchasing patterns) etc. On Wed, Dec 3, 2014 at 5:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com wrote: I have multiple different columns such as category,shipping location,item price,online user, etc. How can i use all these different columns and improve recommendation quality(ie.calculate more precise similarity between users by use of location,item price) ? For some kinds of information, you can build cross recommenders off of that other information. That incorporates this other information in an item-based system. Simply hand coding a similarity usually doesn't work well. The problem is that you don't really know which factors really represent actionable and non-redundant user similarity.
Mahout used for Text Clustering
Hi Mahout Users! I am currently working on Text Clustering and I am using Mahout and Clustering algorithms (kmeans, LDA, canopy etc) for that. I have below questions – 1. Why is Mahout giving out clusters with only 1 observation? 2. Is cluster 1 always catch-all cluster? 3. When I change the k in kmeans and do clusterdump, the total number of observations change as k changes? Why so? Am I missing anything? 4. Does normalization (when creating the vectors) lead to good quality of clustering results, especially for unstructured data. In my case its text data! Thank you in advance for your help! Cheers, V
Topological data analysis
Any interest in a topological data analysis package in Mahout? https://www.google.com/search?q=topological+data+analysis http://danifold.net/mapper/introduction.html http://danifold.net/mapper Would be nice to be able to run jobs and and export to JSON for consumption in D3 or other plotting/visualization tools.
Re: Topological data analysis
Though I don't have an immediate use case, I'd +1 the idea! On Dec 4, 2014, at 3:11 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Any interest in a topological data analysis package in Mahout? https://www.google.com/search?q=topological+data+analysis http://danifold.net/mapper/introduction.html http://danifold.net/mapper Would be nice to be able to run jobs and and export to JSON for consumption in D3 or other plotting/visualization tools.