Re: taking mahout into production

Sean Owen Fri, 20 May 2011 19:05:11 -0700

I agree that ratings contain relatively little data. Here you're not using
direct ratings, but inferring some notion of rating from impressions. Does
your scheme make sense? It's not illogical but not one I would choose. To
me, there is the most "information" in the jump from 0 impressions to 1.
There are a universe of things you don't look at; the fact that you look at
something at all is much more significant. Looking at something 2, 3, 10,
100 times from there means something more, but not much more in comparison.

So, I might suggest using log_2(impressions) or something similar as a
starting point. But I also might try ignoring the impression count itself
entirely.

Cold start: before you have any information at all about the user, there's
not much you can do but recommend some canned, fixed list of top items.

What do you mean by "parse the entire dataset"? Yes it's normal to actually
use all your data. No it's not at all a good idea to read it all every time
you do anything.

I think a recommender based on item-item similarity sounds like a better
starting point here, though either approach might have merit. You can
conceivably use user-user similarities from this domain to create
recommendations in another domain, yes.

On Fri, May 20, 2011 at 6:31 PM, Varnit Khanna <[email protected]> wrote:

> Hi,
> I have been considering using mahout for our recommendation engine
> needs and had couple of questions about using it in production.
>
> Use Case:
> We need to provide recommendation on video assets (similar to hulu) to
> couple of million users and we have over 100K assets. Since we are
> experiencing growth both in users and assets I am planning to use
> mahout on hadoop.
>
> Preference Data:
> Currently we do not have a ratings system built into our video
> player/page but we do have logs on user impressions on video assets
> which I will be feeding into RecommenderJob. Until we build a ratings
> system I am planning on using the following preference data:
>
> Impressions | Rating
>                1 |  (empty)
>                2 | 2
>                3 | 3
>                4 | 4
>            >=5 | 5
>
> Does this preference data make sense? I will be using the standard
> RecommenderJob to generate recommendations until I get a better
> understanding of mahout.
>
> Questions:
> 1) What will be the best approach to deal with cold start on new
> assets and users?
> 2) Is it typical to parse the entire dataset in production to generate
> recommendations for new assets and users or can it be done
> incrementally?
> 3) What is a better approach for this use case item or user based CF?
> Also at some point in the future we would like to generate
> recommendations on news assets so a single system might be beneficial.
>
> Thanks
> -varnit
>

Re: taking mahout into production

Reply via email to