I thought topK will save us...for each user we have 1xrank...now our movie factor is a RDD...we pick topK movie factors based on vector norm...with K = 50, we will have 50 vectors * num_executors in a RDD...with the user 1xrank we do a distributed dot product using RowMatrix APIs...
May be we can't find topK using vector norm on movie factors... On Thu, Oct 30, 2014 at 1:12 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Looking at > https://github.com/apache/spark/blob/814a9cd7fabebf2a06f7e2e5d46b6a2b28b917c2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L82 > > For each user in test set, you generate an Array of top K predicted item > ids (Int or String probably), and an Array of ground truth item ids (the > known rated or liked items in the test set for that user), and pass that to > precisionAt(k) to compute MAP@k (Actually this method name is a bit > misleading - it should be meanAveragePrecisionAt where the other method > there is without a cutoff at k. However, both compute MAP). > > The challenge at scale is actually computing all the top Ks for each user, > as it requires broadcasting all the item factors (unless there is a smarter > way?) > > I wonder if it is possible to extend the DIMSUM idea to computing top K > matrix multiply between the user and item factor matrices, as opposed to > all-pairs similarity of one matrix? > > On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das <debasish.da...@gmail.com> > wrote: > >> Is there an example of how to use RankingMetrics ? >> >> Let's take the user, document example...we get user x topic and document x >> topic matrices as the model... >> >> Now for each user, we can generate topK document by doing a sort on (1 x >> topic)dot(topic x document) and picking topK... >> >> Is it possible to validate such a topK finding algorithm using >> RankingMetrics ? >> >> >> On Wed, Oct 29, 2014 at 12:14 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> > Let's narrow the context from matrix factorization to recommendation >> > via ALS. It adds extra complexity if we treat it as a multi-class >> > classification problem. ALS only outputs a single value for each >> > prediction, which is hard to convert to probability distribution over >> > the 5 rating levels. Treating it as a binary classification problem or >> > a ranking problem does make sense. The RankingMetricc is in master. >> > Free free to add prec@k and ndcg@k to examples.MovielensALS. ROC >> > should be good to add as well. -Xiangrui >> > >> > >> > On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das < >> debasish.da...@gmail.com> >> > wrote: >> > > Hi, >> > > >> > > In the current factorization flow, we cross validate on the test >> dataset >> > > using the RMSE number but there are some other measures which are >> worth >> > > looking into. >> > > >> > > If we consider the problem as a regression problem and the ratings 1-5 >> > are >> > > considered as 5 classes, it is possible to generate a confusion matrix >> > > using MultiClassMetrics.scala >> > > >> > > If the ratings are only 0/1 (like from the spotify demo from spark >> > summit) >> > > then it is possible to use Binary Classification Metrices to come up >> with >> > > the ROC curve... >> > > >> > > For topK user/products we should also look into prec@k and pdcg@k as >> the >> > > metric.. >> > > >> > > Does it make sense to add the multiclass metric and prec@k, pdcg@k in >> > > examples.MovielensALS along with RMSE ? >> > > >> > > Thanks. >> > > Deb >> > >> > >