I thought topK will save us...for each user we have 1xrank...now our movie
factor is a RDD...we pick topK movie factors based on vector norm...with K
= 50, we will have 50 vectors * num_executors in a RDD...with the user
1xrank we do a distributed dot product using RowMatrix APIs...

May be we can't find topK using vector norm on movie factors...

On Thu, Oct 30, 2014 at 1:12 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Looking at
> https://github.com/apache/spark/blob/814a9cd7fabebf2a06f7e2e5d46b6a2b28b917c2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L82
>
> For each user in test set, you generate an Array of top K predicted item
> ids (Int or String probably), and an Array of ground truth item ids (the
> known rated or liked items in the test set for that user), and pass that to
> precisionAt(k) to compute MAP@k (Actually this method name is a bit
> misleading - it should be meanAveragePrecisionAt where the other method
> there is without a cutoff at k. However, both compute MAP).
>
> The challenge at scale is actually computing all the top Ks for each user,
> as it requires broadcasting all the item factors (unless there is a smarter
> way?)
>
> I wonder if it is possible to extend the DIMSUM idea to computing top K
> matrix multiply between the user and item factor matrices, as opposed to
> all-pairs similarity of one matrix?
>
> On Thu, Oct 30, 2014 at 5:28 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Is there an example of how to use RankingMetrics ?
>>
>> Let's take the user, document example...we get user x topic and document x
>> topic matrices as the model...
>>
>> Now for each user, we can generate topK document by doing a sort on (1 x
>> topic)dot(topic x document) and picking topK...
>>
>> Is it possible to validate such a topK finding algorithm using
>> RankingMetrics ?
>>
>>
>> On Wed, Oct 29, 2014 at 12:14 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> > Let's narrow the context from matrix factorization to recommendation
>> > via ALS. It adds extra complexity if we treat it as a multi-class
>> > classification problem. ALS only outputs a single value for each
>> > prediction, which is hard to convert to probability distribution over
>> > the 5 rating levels. Treating it as a binary classification problem or
>> > a ranking problem does make sense. The RankingMetricc is in master.
>> > Free free to add prec@k and ndcg@k to examples.MovielensALS. ROC
>> > should be good to add as well. -Xiangrui
>> >
>> >
>> > On Wed, Oct 29, 2014 at 11:23 AM, Debasish Das <
>> debasish.da...@gmail.com>
>> > wrote:
>> > > Hi,
>> > >
>> > > In the current factorization flow, we cross validate on the test
>> dataset
>> > > using the RMSE number but there are some other measures which are
>> worth
>> > > looking into.
>> > >
>> > > If we consider the problem as a regression problem and the ratings 1-5
>> > are
>> > > considered as 5 classes, it is possible to generate a confusion matrix
>> > > using MultiClassMetrics.scala
>> > >
>> > > If the ratings are only 0/1 (like from the spotify demo from spark
>> > summit)
>> > > then it is possible to use Binary Classification Metrices to come up
>> with
>> > > the ROC curve...
>> > >
>> > > For topK user/products we should also look into prec@k and pdcg@k as
>> the
>> > > metric..
>> > >
>> > > Does it make sense to add the multiclass metric and prec@k, pdcg@k in
>> > > examples.MovielensALS along with RMSE ?
>> > >
>> > > Thanks.
>> > > Deb
>> >
>>
>
>

Reply via email to