Re: parallelALS and RMSE TEST

2013-05-06 Thread William
Ted Dunning gmail.com> writes: > > WIthout more information it is impossible to comment. > > What experiments? > > On Fri, May 3, 2013 at 8:45 AM, William gmail.com> wrote: > > > I'm trying to get some recommendations with three Algorithms: > > 1.parallelALS > > 2.evaluateFactorization > >

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. On Mon, May 6, 2013 at 10:17 AM, William wrote: > I have a dataset about user and movie(no rate).But I want to get some > recommendati

Re: parallelALS and RMSE TEST

2013-05-06 Thread William
Sean Owen gmail.com> writes: > > If you have no ratings, how are you using RMSE? this typically > measures error in reconstructing ratings. > I think you are probably measuring something meaningless. > I suppose the rate of seen movies are 1. Is it right? If I use Collaborative Filtering with

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
ALS-WR weights the error on each term differently, so the average error doesn't really have meaning here, even if you are comparing the difference with "1". I think you will need to fall back to mean average precision or something. On Mon, May 6, 2013 at 11:24 AM, William wrote: > Sean Owen gmai

Clustering Categorical Data

2013-05-06 Thread Florents Tselai
Hello, Are there any suggestions on what mahout algorithms (from mahout) to use for clustering categorical data?

Re: Clustering Categorical Data

2013-05-06 Thread Ted Dunning
It really depends on your data, but anything that works on text has at least a potential for working on categorical data. It is common to use a 1-of-n encoding for categorical data and then simply use Euclidean distance with something like k-means. Can you say something about how many variables a

Re: Clustering Categorical Data

2013-05-06 Thread Florents Tselai
I'm working on Market Basket Analysis. The "small" data sets consists of 4 transactions (baskets) and 35 categories. While the large data sets is about 30million baskets and 400 categories. On Mon, May 6, 2013 at 9:17 PM, Ted Dunning wrote: > It really depends on your data, but anything tha

Clustering product views and sales

2013-05-06 Thread Dominik Hübner
I am currently working on a dataset containing product views and sales of about 10^7 users and 6000 items for my master's thesis in CS. My goal is to build product clusters from this. As expected, item-(row)-vectors are VERY sparse. My current approach is to implement PCA using the SVDSolver cla

Re: Clustering Categorical Data

2013-05-06 Thread Ted Dunning
So this isn't really categorical data. But that is good news. You can still use the binary representation and there is a good possibility that these data will cluster reasonably, especially with spectral techniques. What I would recommend, however, is that cooccurrence analysis might give you a

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
On Mon, May 6, 2013 at 11:29 AM, Dominik Hübner wrote: > Oh, and I forgot how the views and sales are used to build product > vectors. As of now, I implemented binary vectors, vectors counting the > number of views and sales (e.g 1view=1count, 1sale=10counts) and ordinary > vectors ( view => 1, sa

Re: Clustering product views and sales

2013-05-06 Thread Dominik Hübner
And running the clustering on the cooccurrence matrix or doing PCA by removing eigenvalues/vectors? On May 6, 2013, at 8:52 PM, Ted Dunning wrote: > On Mon, May 6, 2013 at 11:29 AM, Dominik Hübner wrote: > >> Oh, and I forgot how the views and sales are used to build product >> vectors. As of

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
I don't even think that clustering is all that necessary. The reduced cooccurrence matrix will give you items related to each item. You can use something like PCA, but SVD is just as good here due to near zero mean. You could SSVD or ALS from Mahout to do this analysis and then use k-means on th

Re: parallelALS and RMSE TEST

2013-05-06 Thread Tevfik Aytekin
This problem is called one-class classification problem. In the domain of collaborative filtering it is called one-class collaborative filtering (since what you have are only positive preferences). You may search the web with these key words to find papers providing solutions. I'm not sure whether

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
Yes, it goes by the name 'boolean prefs' in the project since target variables don't have values -- they just exist or don't. So, yes it's certainly supported but the question here is how to evaluate the output. On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin wrote: > This problem is called one-cl

Re: Clustering product views and sales

2013-05-06 Thread Dominik Hübner
Well, as you already might have guessed, I am building a product recommender system for my thesis. I am planning to evaluate ALS (both, implicit and explicit) as well as item -similarity recommendation for users with at least a few known products. Nevertheless, the majority of users only has s

Re: Clustering product views and sales

2013-05-06 Thread Koobas
Since Dominik mentioned item-based and ALS, let me throw in a question here. I believe that one of the Netflix price solutions combined KNN and ALS. 1) What is the best way to combine the results of both? 2) Is there really merit to this approach? 3) Are there other combinations that make sense?

Re: parallelALS and RMSE TEST

2013-05-06 Thread Tevfik Aytekin
Hi Sean, Isn't boolean preferences is supported in the context of memory-based recommendation algorithms in Mahout? Are there matrix factorization algorithms in Mahout which can work with this kind of data (that is, the kind of data which consists of users and the movies they have seen). On Mon

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
Parallel ALS is exactly an example of where you can use matrix factorization for "0/1" data. On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin wrote: > Hi Sean, > Isn't boolean preferences is supported in the context of memory-based > recommendation algorithms in Mahout? > Are there matrix factoriza

Re: parallelALS and RMSE TEST

2013-05-06 Thread Tevfik Aytekin
But the data under consideration here is not 0/1 data, it contains only 1's. On Mon, May 6, 2013 at 11:29 PM, Sean Owen wrote: > Parallel ALS is exactly an example of where you can use matrix > factorization for "0/1" data. > > On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin > wrote: >> Hi Sean,

Re: parallelALS and RMSE TEST

2013-05-06 Thread Sean Owen
Yes, that's really what I mean. ALS factors, among other things, a matrix of 1 where an interaction occurs and nothing (implicitly 0) everywhere else. On Mon, May 6, 2013 at 9:40 PM, Tevfik Aytekin wrote: > But the data under consideration here is not 0/1 data, it contains only 1's. > > On Mon, M

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
Are you looking to build a product recommender based on your own design? Or do you want to build one based on existing methods? If you want to use existing methods, clustering has essentially no role. I think that composite approaches that use item meta-data and different kinds of behavioral cue

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
On Mon, May 6, 2013 at 12:50 PM, Koobas wrote: > Since Dominik mentioned item-based and ALS, let me throw in a question > here. > I believe that one of the Netflix price solutions combined KNN and ALS. > > 1) What is the best way to combine the results of both? > I think that combinations are im

Re: Clustering product views and sales

2013-05-06 Thread Koobas
I think I see the picture now. Thanks! On Mon, May 6, 2013 at 5:25 PM, Ted Dunning wrote: > On Mon, May 6, 2013 at 12:50 PM, Koobas wrote: > > > Since Dominik mentioned item-based and ALS, let me throw in a question > > here. > > I believe that one of the Netflix price solutions combined KNN a

Re: Clustering product views and sales

2013-05-06 Thread Dominik Hübner
The cluster was mostly intended for tackling the cold start problem for new users. I want to build a recommender based on existing components or to be precise a combination of them. Unfortunately, the only product meta-data I currently have is the product price. Furthermore, this is a project

Re: Clustering product views and sales

2013-05-06 Thread Sean Owen
It sounds like you don't quite have a cold start problem. You have a few behaviors, a few views or clicks, not zero. So you really just need to find an approach that's quite comfortable with sparse input. A low-rank factorization model like ALS works fine in this case, for example. There's a circu

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
Truly cold start is best handled by recommending the most popular items. If you know *anything* at all such as geo or browser or OS, then you can use that to recommend using conventional techniques (that is, you can recommend for the characteristics rather than for the person). Within a very few

Re: Clustering product views and sales

2013-05-06 Thread Dominik Hübner
One more thing for now @Ted: What do you refer to with sparsification and reconstruction? On May 7, 2013, at 12:19 AM, Ted Dunning wrote: > Truly cold start is best handled by recommending the most popular items. > > If you know *anything* at all such as geo or browser or OS, then you can > use

Re: Clustering product views and sales

2013-05-06 Thread Johannes Schulte
Hi! As a starting point I remember this conversation containing both elements (although the reconstruction part is rather small, hint!) http://markmail.org/message/5cfewal3oyt6vw2k On Tue, May 7, 2013 at 1:00 AM, Dominik Hübner wrote: > One more thing for now @Ted: > What do you refer to with