On Thu, Feb 21, 2013 at 12:13 PM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> I am quite interested in trying this but have a few questions.
>
> To use/abuse mahout to do this:
>
> A and B can be thought of as having the same size, in other words they
> must be constructed to have the same dimension definitions (userID for
> rows, itemID for columns) as well as row and column rank.
>

I generally only assume the same userID space (i.e. same number of rows).


> So A and B should be constructed to have the same users and items, even
> though a column or row in either matrix may be 0. Some work will be needed
> to produce this so I want to be sure I understand.
>

Less work if you only assume user identity.

If you co-group your transaction data, then you commonly will get a list of
A transactions and a list of B transactions for each user which is all you
should need.

The conclusion is B'A h_v + B'B h_p will produce view-based cross
> recommendations along with purchase based recs.
>

This term only produces purchase recommendations. If you want both, you
need to include the A'A h_v + A'B h_p part of the vector.


> Using mahout I'd get purchase based recs by constructing the usual
> purchase matrix [user x item] and getting recs from the framework, as I'm
> already doing. Assuming here that B'B is what you are calling the self-join?
>

Yes.


> So the primary work to be done is to calculate B'A. Since the CF/taste
> framework does the self-join to prepare for making recs for users, I would
> have to replace the self-joined matrix with B'A then allow the rest of the
> framework to produce recs in the usual way as if it had calculated the
> self-joined matrix. Does this seem reasonable?
>

Yes.  In pig, there is a co-group that would be helpful.  Alternately, you
can group views and purchases separately and then join.  Co-group saves a
map-reduce step.


> Then the question is how to blend B'A h_v with B'B h_p?
>

The range of both of these will be identical.  Each row of [B'A | B'B]
corresponds to a document.  One field (the view=>purchase indicators)
contains a row of B'A and another field (the purchase=>purchase indicators)
will contain a row of B'B.

The query will ultimately contain two fields corresponding to recent views
and recent purchases.  The search engine will combine the scores from these
intelligently without any effort on your part.  You can tune how this
works, but I haven't ever found that very useful.

Since I'm using values = 1 or 0 the strengths will be on the same scale,
> but are they really comparable?
>

Missing from this formulation are the weights that the search engine places
on things.  This has to do with how many of items have each indicator.
 Common indicators will have little weight and rare ones great weight.  For
instance, it might be that everybody (well, all the car buffs, anyway)
looks at the Ferrari's, but few buy them.  This would mean that the views
would have little weight but the purchase would have a large weight.

I'd be inclined to try sorting B'A recs + B'B recs by strength and try some
> other experiments with blending techniques then look at eval metrics.
>

Start simple.

Remember that these are systems with feedback.  That means that once the
system goes live (and if you have dithering set up) it will quickly tune
away the false-positive mistakes it is making by showing the events in
question and seeing that they don't lead to success.  This closed loop
nature makes excessive refinement in weighting schemes largely unnecessary.

Reply via email to