Not following so…

Here so is what I've done in probably too much detail:

1) ingest raw log files and split them up by action
2) turn these into Mahout preference files using Mahout type IDs, keeping a map 
of IDs
3) run the Mahout Item-based recommender using LLR for similarity
4) created a Mahout style cross-recommender using cooccurrence similarity using 
matrix math
5) given two similairty matrixes and a user history matrix I am writing them to 
csv files with Mahout ID replaced by the original string external IDs for users 
and items

input log file before splitting:
u1      purchase        iphone
u1      purchase        ipad
u2      purchase        nexus-tablet
u2      purchase        galaxy
u3      purchase        surface
u4      purchase        iphone
u4      purchase        ipad
u1      view    iphone
u1      view    ipad
u1      view    nexus-tablet
u1      view    galaxy
u2      view    iphone
u2      view    ipad
u2      view    nexus-tablet
u2      view    galaxy
u3      view    surface
u4      view    iphone
u4      view    ipad
u4      view    nexus-tablet


Input user history DRM after ID translation to mahout IDs and splitting for 
action "purchase"

B       user/item       iphone  ipad    nexus-tablet    galaxy  surface
u1      1       1       0       0       0
u2      0       0       1       1       0
u3      0       0       0       0       1
u4      1       1       0       0       0

Map of IDs Mahout to Original/External
0 -> iphone
1 -> ipad
2 -> nexus-tablet
3 -> galaxy
4 -> surface

To be specific the DRM from the RecommenderJob with item-item similarities 
using LLR looks like this:
Input Path: out/p-recs/sims/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: {1:0.8472157541208549}
Key: 1: Value: {0:0.8472157541208549}
Key: 2: Value: {3:0.8181382096075936}
Key: 3: Value: {2:0.8181382096075936}
Key: 4: Value: {}

This will be written to a directory for later Solr indexing as a csv of the 
form:
item_id,similar_items,cross_action_similar_items
iphone,ipad,
ipad,iphone,
nexus-tablet,galaxy,
galaxy, nexus-tablet,
surface,,

By using a user's history vector as a query you get results = recommendations
So if the user is u1, the history vector is:
"iphone ipad"

The Solr results for query "iphone ipad" using field "similar_items" will be 
1. Doc ID, ipad
2. Doc ID, iphone

If you want item similarities, for instance if a user is anonymous with no 
history and is looking at an iphone product page. You would fetch the doc for 
id =  "iphone" and get:
"ipad"

Perhaps a bad example for ordering, since there is only one ID in the doc but 
the items in the "similar_items" field would be ordered by similarity strength. 

Likewise for the cross-action similarities though the matrix will have 
cooccurrence [B'A] values in the DRM.

For item similarities there is no need to do more than fetch one doc that 
contains the similarities, right? I've successfully used this method with the 
Mahout recommender but please correct me if something above is wrong. 


On Jul 31, 2013, at 4:52 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

Pat,

See inline


On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> So the XML as CSV would be:
> item_id,similar_items,cross_action_similar_items
> ipad,iphone,iphone nexus
> iphone,ipad,ipad galaxy
> 

Right.  Doesn't matter what format.  Might want quotes around space
delimited lists, but anything will do.


> 
> Note: As I mentioned before the order of the items in the field will
> encode rank of the similarity strength. This is for cases where you want to
> find similar items to a context item. You would fetch the doc for the
> context item by it's item ID and show the top k items in the doc. Ted's
> caveat would probably be to dither them.
> 

I always say "dither" so that is an easy one.

But fetching similar items of a center item by fetching the center item and
then fetching each of the referenced items is typically slower by about 2x
than running the search for mentions of the center item.


> Sounds like Ted is generating data. Andrew or M Lyon do either of you want
> to set the demo system up? If so you'll need to find a system--free tier
> AWS, Ted's box, etc. Then install all the needed stuff.
> 
> I'll get the output working to csv.
> 
> On Jul 31, 2013, at 11:51 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
> OK and yes. The docs will look like:
> 
> <add>
>   <doc>
>      <field name='item_id'>ipad</field>
>      <field name='similar_items'>iphone</field>
>      <field name='cross_action_similar_items'>iphone nexus</field>
>   </doc>
>  <doc>
>    <field name='item_id'>iphone</field>
>    <field name='similar_items'>ipad</field>
>    <field name='cross_action_similar_items'>ipad galaxy</field>
>  </doc>
> </add>
> 
> 
> On Jul 31, 2013, at 11:42 AM, B Lyon <bradfl...@gmail.com> wrote:
> 
> I'm interested in helping as well.
> Btw I thought that what was stored in the solr fields were the llr-filtered
> items (ids I guess) for the could-be-recommended things.
> 
> 

Reply via email to