subject:"Re\: Setting up a recommender"

Re: Setting up a recommender

2014-04-21 Thread Frank Scholten

Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?

Cheers,

Frank


On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I finally got some time to work on this and have a first cut at output to
 Solr working on the github repo. It only works on 2-action input but I'll
 have that cleaned up soon so it will work with one action. Solr indexing
 has not been tested yet and the field names and/or types may need tweaking.

 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence

 There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.

 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,history_b,history_a
 user1,iphone ipad,iphone ipad galaxy
 ...

 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,b_b_links,b_a_links
 u1,iphone ipad,iphone ipad galaxy
 …

 It may work on a cluster, I haven't tried yet. As soon as someone has some
 large-ish sample log files I'll give them a try. Check the sample input
 files in the resources dir for format.

 https://github.com/pferrel/solr-recommender


 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:

 When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?

 However getting further into this I see one very large benefit. It has one
 feature that sets it completely apart from the typical NoSQL db. The type
 of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.

 Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.



 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Yes.  That would be interesting.




 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

  A little digression: Might a Matrix implementation backed by a Solr index
  and uses SolrJ for querying help at all for the Solr recommendation
  approach?
 
  It supports multiple fields of String, Text, or boolean flags.
 
  Best
  Gokhan
 
 
  On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with
 three
  fields, id, A item history, and B item history. Other fields could be
  added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great to
  have example lines for two actions with or without the same item IDs.
  I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the
 one-item-space
  is actually a problem. It just means one item dictionary. A and B will
  have
  the right content, all I have to do is make sure the right ranks are
  input
  to the MM,
  Transpose, and RSJ. This in turn is only one extra count of the # of
  items
  in A's item space. This should be a very easy change If my thinking is
  correct.
 
 
  On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  4) To add more metadata to the Solr output will be left to the consumer
  for now. If there is a good data set to use we can illustrate how to do
  it
  in the project. Ted may have some data for this from musicbrainz.
 
 
  I am working on this issue now.
 
  The

Re: Setting up a recommender

2014-04-21 Thread Pat Ferrel

Yes the cooccurrence item similarity matrix is calculated using LLR using 
Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix 
these days. 

The indicator matrix is then translated from a SequenceFile into a CSV (or 
other text delimited file) which looks like a list of itemIDs—tokens or terms 
in Solr parlance—for each item. These documents are indexed by Solr and the 
query is the user history.

[B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is 
“multiplied” by the indicator matrix by using it as the Solr query against the 
indicator matrix, actually producing a cosine similarity ranked list of items.

You have to squint a little to see the math. Any matrix product can be 
substituted with a row to column similarity metric assuming dimensionality is 
correct. So the product in all the equations should be interpreted as such. So 
to get recs for a user [B’B]h is done in two phases, one calculates [B’B] and 
one is a Solr query that adds the ‘h’ to the equation.

In this project https://github.com/pferrel/solr-recommender both [B’B] and 
[A’B] are calculated, the later uses actual matrix multiply, since we did not 
have a cross-RSJ at the time. Now that we have a cross cooccurrence in the 
Spark Scala Mahout 2 stuff I’ll rewrite the code to use it.

The cross indicator matrix allows you to use two different actions to predict a 
target action. So for example views that are similar to purchases can be used 
to recommend purchases. Take a look at the readme on github it has a quick 
review of the theory.

BTW there is a video recommender site that demos some interesting uses of Solr 
to blend collaborative filtering recs with metadata. It even makes recs based 
of of your most recent detail views on the site. That last doesn’t work all 
that well because it is really a cross recommendation and that isn’t built into 
the site yet. https://guide.finderbots.com


On Apr 21, 2014, at 12:11 PM, Frank Scholten fr...@frankscholten.nl wrote:

Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?

Cheers,

Frank


On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I finally got some time to work on this and have a first cut at output to
 Solr working on the github repo. It only works on 2-action input but I'll
 have that cleaned up soon so it will work with one action. Solr indexing
 has not been tested yet and the field names and/or types may need tweaking.
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,history_b,history_a
 user1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,b_b_links,b_a_links
 u1,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some
 large-ish sample log files I'll give them a try. Check the sample input
 files in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one
 feature that sets it completely apart from the typical NoSQL db. The type
 of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning

Re: Setting up a recommender

2014-04-21 Thread Ted Dunning

RowSimilarityJob is the guts of the work, but ItemSimilarityJob is usually
easier packaging for users.




On Mon, Apr 21, 2014 at 1:00 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Yes the cooccurrence item similarity matrix is calculated using LLR using
 Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix
 these days.

 The indicator matrix is then translated from a SequenceFile into a CSV (or
 other text delimited file) which looks like a list of itemIDs—tokens or
 terms in Solr parlance—for each item. These documents are indexed by Solr
 and the query is the user history.

 [B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is
 “multiplied” by the indicator matrix by using it as the Solr query against
 the indicator matrix, actually producing a cosine similarity ranked list of
 items.

 You have to squint a little to see the math. Any matrix product can be
 substituted with a row to column similarity metric assuming dimensionality
 is correct. So the product in all the equations should be interpreted as
 such. So to get recs for a user [B’B]h is done in two phases, one
 calculates [B’B] and one is a Solr query that adds the ‘h’ to the equation.

 In this project https://github.com/pferrel/solr-recommender both [B’B]
 and [A’B] are calculated, the later uses actual matrix multiply, since we
 did not have a cross-RSJ at the time. Now that we have a cross cooccurrence
 in the Spark Scala Mahout 2 stuff I’ll rewrite the code to use it.

 The cross indicator matrix allows you to use two different actions to
 predict a target action. So for example views that are similar to purchases
 can be used to recommend purchases. Take a look at the readme on github it
 has a quick review of the theory.

 BTW there is a video recommender site that demos some interesting uses of
 Solr to blend collaborative filtering recs with metadata. It even makes
 recs based of of your most recent detail views on the site. That last
 doesn’t work all that well because it is really a cross recommendation and
 that isn’t built into the site yet. https://guide.finderbots.com


 On Apr 21, 2014, at 12:11 PM, Frank Scholten fr...@frankscholten.nl
 wrote:

 Pat and Ted: I am late to the party but this is very interesting!

 I am not sure I understand all the steps, though. Do you still create a
 cooccurrence matrix and compute LLR scores during this process or do you
 only compute matrix multiplication times the history vector: B'B * h and
 B'A * h?

 Cheers,

 Frank


 On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  I finally got some time to work on this and have a first cut at output to
  Solr working on the github repo. It only works on 2-action input but I'll
  have that cleaned up soon so it will work with one action. Solr indexing
  has not been tested yet and the field names and/or types may need
 tweaking.
 
  It takes the result of the previous drop:
  1) DRMs for B (user history or B items action1) and A (user history of A
  items action2)
  2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
  There are two final outputs created using mapreduce but requiring 2
  in-memory hashmaps. I think this will work on a cluster (the hashmaps are
  instantiated on each node) but haven't tried yet. It orders items in #2
  fields by strength of link, which is the similarity value used in [B'B]
  or [B'A]. It would be nice to order #1 by recency but there is no
 provision
  for passing through timestamps at present so they are ordered by the
  strength of preference. This is probably not useful and so can be
 ignored.
  Ordering by recency might be useful for truncating queries by recency
 while
  leaving the training data containing 100% of available history.
 
  1) It joins #1 DRMs to produce a single set of docs in CSV form, which
  looks like this:
  id,history_b,history_a
  user1,iphone ipad,iphone ipad galaxy
  ...
 
  2) it joins #2 DRMs to produce a single set of docs in CSV form, which
  looks like this:
  id,b_b_links,b_a_links
  u1,iphone ipad,iphone ipad galaxy
  …
 
  It may work on a cluster, I haven't tried yet. As soon as someone has
 some
  large-ish sample log files I'll give them a try. Check the sample input
  files in the resources dir for format.
 
  https://github.com/pferrel/solr-recommender
 
 
  On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  When I started looking at this I was a bit skeptical. As a Search engine
  Solr may be peerless, but as yet another NoSQL db?
 
  However getting further into this I see one very large benefit. It has
 one
  feature that sets it completely apart from the typical NoSQL db. The type
  of queries you do return fuzzy results--in the very best sense of that
  word. The most interesting queries are based on similarity to some
  exemplar. Results are returned in order of similarity strength, not
 ordered
  by a sort field.
 
  Wherever similarity based queries are important

Re: Setting up a recommender

2013-08-19 Thread Pat Ferrel

There are three things I could work on my free time:

1) test this on a bigger data set gathered from rotten tomatoes, which only has 
B data (movie thumbs up) 
2) begin work on the Solr query and service integration, rather than the 
current loose LucidWorks Search integration.
3) make sure everything is set up for different item spaces in B and A.

Planning to tackle in this order, unless someone speaks up.

 
On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Works on a cluster but have only tested on the trivial test data set. 

On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

OK single action recs are working so output to Solr with only [B'B] and B.

On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

Corrections inline

 On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 I finally got some time to work on this and have a first cut at output to 
 Solr working on the github repo. It only works on 2-action input but I'll 
 have that cleaned up soon so it will work with one action. Solr indexing has 
 not been tested yet and the field names and/or types may need tweaking. 
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A 
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2 in-memory 
 hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
 on each node) but haven't tried yet. It orders items in #2 fields by strength 
 of link, which is the similarity value used in [B'B] or [B'A]. It would be 
 nice to order #1 by recency but there is no provision for passing through 
 timestamps at present so they are ordered by the strength of preference. This 
 is probably not useful and so can be ignored. Ordering by recency might be 
 useful for truncating queries by recency while leaving the training data 
 containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some 
 large-ish sample log files I'll give them a try. Check the sample input files 
 in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine Solr 
 may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one 
 feature that sets it completely apart from the typical NoSQL db. The type of 
 queries you do return fuzzy results--in the very best sense of that word. The 
 most interesting queries are based on similarity to some exemplar. Results 
 are returned in order of similarity strength, not ordered by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first. 
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
 probably at least an alternative to using docs and CSVs to import the data 
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  That would be interesting.
 
 
 
 
 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item

Re: Setting up a recommender

2013-08-19 Thread Ted Dunning

Pat,

That really sounds great.

I should find some time (who needs sleep) to generate music logs for you as
well.


On Mon, Aug 19, 2013 at 8:31 AM, Pat Ferrel p...@occamsmachete.com wrote:

 There are three things I could work on my free time:

 1) test this on a bigger data set gathered from rotten tomatoes, which
 only has B data (movie thumbs up)
 2) begin work on the Solr query and service integration, rather than the
 current loose LucidWorks Search integration.
 3) make sure everything is set up for different item spaces in B and A.

 Planning to tackle in this order, unless someone speaks up.


 On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Works on a cluster but have only tested on the trivial test data set.

 On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 OK single action recs are working so output to Solr with only [B'B] and B.

 On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Corrections inline

  On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  I finally got some time to work on this and have a first cut at output
 to Solr working on the github repo. It only works on 2-action input but
 I'll have that cleaned up soon so it will work with one action. Solr
 indexing has not been tested yet and the field names and/or types may need
 tweaking.
 
  It takes the result of the previous drop:
  1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
  2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
  There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.
 
  1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
  id,history_b,history_a
 u1,iphone ipad,iphone ipad galaxy
  ...
 
  2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
  id,b_b_links,b_a_links
 iphone,iphone ipad,iphone ipad galaxy
  …
 
  It may work on a cluster, I haven't tried yet. As soon as someone has
 some large-ish sample log files I'll give them a try. Check the sample
 input files in the resources dir for format.
 
  https://github.com/pferrel/solr-recommender
 
 
  On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?
 
  However getting further into this I see one very large benefit. It has
 one feature that sets it completely apart from the typical NoSQL db. The
 type of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.
 
  Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.
 
 
 
  On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  That would be interesting.
 
 
 
 
  On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
  A little digression: Might a Matrix implementation backed by a Solr
 index
  and uses SolrJ for querying help at all for the Solr recommendation
  approach?
 
  It supports multiple fields of String, Text, or boolean flags.
 
  Best
  Gokhan
 
 
  On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with
 three
  fields, id, A item history, and B item history. Other fields could be
  added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great
 to
  have example lines for two actions with or without the same item IDs.
  I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the
 one-item-space
  is actually a problem. It just

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel

When I started looking at this I was a bit skeptical. As a Search engine Solr 
may be peerless, but as yet another NoSQL db?

However getting further into this I see one very large benefit. It has one 
feature that sets it completely apart from the typical NoSQL db. The type of 
queries you do return fuzzy results--in the very best sense of that word. The 
most interesting queries are based on similarity to some exemplar. Results are 
returned in order of similarity strength, not ordered by a sort field.

Wherever similarity based queries are important I'll look at Solr first. SolrJ 
looks like an interesting way to get Solr queries on POJOs. It's probably at 
least an alternative to using docs and CSVs to import the data from Mahout.



On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).
 
 There is a hitch in bringing in the data needed to generate the logs
 since
 that part of MB is not Apache compatible.  I am working on that issue.
 
 Technically, the data is in a massively normalized relational form right
 now, but it isn't terribly hard to denormalize into a form that we need.

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel

I finally got some time to work on this and have a first cut at output to Solr 
working on the github repo. It only works on 2-action input but I'll have that 
cleaned up soon so it will work with one action. Solr indexing has not been 
tested yet and the field names and/or types may need tweaking. 

It takes the result of the previous drop:
1) DRMs for B (user history or B items action1) and A (user history of A items 
action2)
2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence

There are two final outputs created using mapreduce but requiring 2 in-memory 
hashmaps. I think this will work on a cluster (the hashmaps are instantiated on 
each node) but haven't tried yet. It orders items in #2 fields by strength of 
link, which is the similarity value used in [B'B] or [B'A]. It would be nice 
to order #1 by recency but there is no provision for passing through timestamps 
at present so they are ordered by the strength of preference. This is probably 
not useful and so can be ignored. Ordering by recency might be useful for 
truncating queries by recency while leaving the training data containing 100% 
of available history.

1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
like this:
id,history_b,history_a
user1,iphone ipad,iphone ipad galaxy
...

2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
like this:
id,b_b_links,b_a_links
u1,iphone ipad,iphone ipad galaxy
…

It may work on a cluster, I haven't tried yet. As soon as someone has some 
large-ish sample log files I'll give them a try. Check the sample input files 
in the resources dir for format.

https://github.com/pferrel/solr-recommender


On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:

When I started looking at this I was a bit skeptical. As a Search engine Solr 
may be peerless, but as yet another NoSQL db?

However getting further into this I see one very large benefit. It has one 
feature that sets it completely apart from the typical NoSQL db. The type of 
queries you do return fuzzy results--in the very best sense of that word. The 
most interesting queries are based on similarity to some exemplar. Results are 
returned in order of similarity strength, not ordered by a sort field.

Wherever similarity based queries are important I'll look at Solr first. SolrJ 
looks like an interesting way to get Solr queries on POJOs. It's probably at 
least an alternative to using docs and CSVs to import the data from Mahout.



On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).
 
 There is a hitch in bringing in the data needed to generate the logs
 since
 that part of MB is not Apache compatible.  I am working on that issue.
 
 Technically, the data is in a massively normalized relational form right
 now, but it isn't terribly hard to denormalize

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel

Corrections inline

 On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 I finally got some time to work on this and have a first cut at output to 
 Solr working on the github repo. It only works on 2-action input but I'll 
 have that cleaned up soon so it will work with one action. Solr indexing has 
 not been tested yet and the field names and/or types may need tweaking. 
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A 
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2 in-memory 
 hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
 on each node) but haven't tried yet. It orders items in #2 fields by strength 
 of link, which is the similarity value used in [B'B] or [B'A]. It would be 
 nice to order #1 by recency but there is no provision for passing through 
 timestamps at present so they are ordered by the strength of preference. This 
 is probably not useful and so can be ignored. Ordering by recency might be 
 useful for truncating queries by recency while leaving the training data 
 containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some 
 large-ish sample log files I'll give them a try. Check the sample input files 
 in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine Solr 
 may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one 
 feature that sets it completely apart from the typical NoSQL db. The type of 
 queries you do return fuzzy results--in the very best sense of that word. The 
 most interesting queries are based on similarity to some exemplar. Results 
 are returned in order of similarity strength, not ordered by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first. 
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
 probably at least an alternative to using docs and CSVs to import the data 
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  That would be interesting.
 
 
 
 
 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).
 
 There is a hitch in bringing in the data needed to generate the logs
 since
 that part of MB is not Apache

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel

OK single action recs are working so output to Solr with only [B'B] and B.

On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

Corrections inline

 On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 I finally got some time to work on this and have a first cut at output to 
 Solr working on the github repo. It only works on 2-action input but I'll 
 have that cleaned up soon so it will work with one action. Solr indexing has 
 not been tested yet and the field names and/or types may need tweaking. 
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A 
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2 in-memory 
 hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
 on each node) but haven't tried yet. It orders items in #2 fields by strength 
 of link, which is the similarity value used in [B'B] or [B'A]. It would be 
 nice to order #1 by recency but there is no provision for passing through 
 timestamps at present so they are ordered by the strength of preference. This 
 is probably not useful and so can be ignored. Ordering by recency might be 
 useful for truncating queries by recency while leaving the training data 
 containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some 
 large-ish sample log files I'll give them a try. Check the sample input files 
 in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine Solr 
 may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one 
 feature that sets it completely apart from the typical NoSQL db. The type of 
 queries you do return fuzzy results--in the very best sense of that word. The 
 most interesting queries are based on similarity to some exemplar. Results 
 are returned in order of similarity strength, not ordered by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first. 
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
 probably at least an alternative to using docs and CSVs to import the data 
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  That would be interesting.
 
 
 
 
 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items

Re: Setting up a recommender

2013-08-12 Thread Gokhan Capan

A little digression: Might a Matrix implementation backed by a Solr index
and uses SolrJ for querying help at all for the Solr recommendation
approach?

It supports multiple fields of String, Text, or boolean flags.

Best
Gokhan


On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Also a question about user history.

 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be added
 for users metadata.

 Sound correct? This is what I'll do unless someone stops me.

 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs. I'll
 make sure we can digest it.

 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will have
 the right content, all I have to do is make sure the right ranks are input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of items
 in A's item space. This should be a very easy change If my thinking is
 correct.


 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  4) To add more metadata to the Solr output will be left to the consumer
  for now. If there is a good data set to use we can illustrate how to do
 it
  in the project. Ted may have some data for this from musicbrainz.


 I am working on this issue now.

 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).

 There is a hitch in bringing in the data needed to generate the logs since
 that part of MB is not Apache compatible.  I am working on that issue.

 Technically, the data is in a massively normalized relational form right
 now, but it isn't terribly hard to denormalize into a form that we need.

Re: Setting up a recommender

2013-08-12 Thread Ted Dunning

Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?

 It supports multiple fields of String, Text, or boolean flags.

 Best
 Gokhan


 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with three
  fields, id, A item history, and B item history. Other fields could be
 added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great to
  have example lines for two actions with or without the same item IDs.
 I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the one-item-space
  is actually a problem. It just means one item dictionary. A and B will
 have
  the right content, all I have to do is make sure the right ranks are
 input
  to the MM,
  Transpose, and RSJ. This in turn is only one extra count of the # of
 items
  in A's item space. This should be a very easy change If my thinking is
  correct.
 
 
  On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
   4) To add more metadata to the Solr output will be left to the consumer
   for now. If there is a good data set to use we can illustrate how to do
  it
   in the project. Ted may have some data for this from musicbrainz.
 
 
  I am working on this issue now.
 
  The current state is that I can bring in a bunch of track names and links
  to artist names and so on.  This would provide the basic set of items
  (artists, genres, tracks and tags).
 
  There is a hitch in bringing in the data needed to generate the logs
 since
  that part of MB is not Apache compatible.  I am working on that issue.
 
  Technically, the data is in a massively normalized relational form right
  now, but it isn't terribly hard to denormalize into a form that we need.

Re: Setting up a recommender

2013-08-07 Thread Ted Dunning

On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.

Re: Setting up a recommender

2013-08-07 Thread Pat Ferrel

Once you have a sample or example of what you think the 
log file version will look like, can you post it? It would be great to have 
example lines for two actions with or without the same item IDs. I'll make sure 
we can digest it.

I thought more about the ingest part and I don't think the one-item-space is 
actually a problem. It just means one item dictionary. A and B will have the 
right content, all I have to do is make sure the right ranks are input to the 
MM, 
Transpose, and RSJ. This in turn is only one extra count of the # of items in 
A's item space. This should be a very easy change If my thinking is correct.


On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.

Re: Setting up a recommender

2013-08-07 Thread Pat Ferrel

Also a question about user history.

I was planning to write these into separate directories so Solr could fetch 
them from different sources but it occurs to me that it would be better to join 
A and B by user ID and output a doc per user ID with three fields, id, A item 
history, and B item history. Other fields could be added for users metadata.

Sound correct? This is what I'll do unless someone stops me.

On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:

Once you have a sample or example of what you think the 
log file version will look like, can you post it? It would be great to have 
example lines for two actions with or without the same item IDs. I'll make sure 
we can digest it.

I thought more about the ingest part and I don't think the one-item-space is 
actually a problem. It just means one item dictionary. A and B will have the 
right content, all I have to do is make sure the right ranks are input to the 
MM, 
Transpose, and RSJ. This in turn is only one extra count of the # of items in 
A's item space. This should be a very easy change If my thinking is correct.


On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.

Re: Setting up a recommender

2013-08-06 Thread Pat Ferrel

A note about todays hangout regarding the cross-recommender.

In general it may be good way to think about the current and proposed system as 
two pipelines:

1) a pipeline that takes preference data, turn it into two preference matrices 
in Mahout DRM form and creates [B'B] and [B'A] ideally using LLR Row and 
CrossRowSimilairtyJobs. This generates two DRMs with Mahout Keys and 
VectorWritable(s) with internal numerical Mahout IDs. There is one ID space for 
B and one for A. In the github repo these also create recommendations in Mahout 
form as an Items-based RecommenderJob and XRecommenderJob. This last step is 
not needed when using Solr but may be useful for comparison. These Jobs are all 
mapreduce and closely match the Mahout code and model of calculation. 

2) a pipeline that processes IDs and other metadata contained in the logs. The 
IDs are user IDs in string form as are the Items IDs. But the Items for A 
action may be completely different from B. This cross-recommender ties the two 
together with a generalized notion of significant cooccurrence using by 
executing the #1 pipeline and using the results. These log file IDs are what 
gets written out to Solr. Which IDs is encoded in the two Mahout generated 
DRMs. The pipeline may need to bring along other metadata mined from the logs 
like item descriptions, tags, categories, etc. Note: This is last bit is not 
build in at present but would make Solr queries even better. Also at present A 
and B are assumed to have the same item IDs. This works for purchase+view 
actions and other but not for some cross-actions that would be useful like 
music track listen + tagged category listen - track recommendation or music 
tagged category listen+track listen - category recommendation.

The current action items are:
1) #1 is running and works but eventually needs to be reintegrated with new 
Mahout trunk code--my action item, with Sebastian's help.
2) #2 needs to write the merged DRMs to Solr as one doc per row and 3 fields 
per doc (id, B'B, B'A)--I'm is working on this now.
3) To generalize further we need to account for different ID spaces in #2 and 
I'll take that as an action item.
4) To add more metadata to the Solr output will be left to the consumer for 
now. If there is a good data set to use we can illustrate how to do it in the 
project. Ted may have some data for this from musicbrainz.

Re: Setting up a recommender

2013-08-05 Thread Pat Ferrel

In writing the similarity matrices to Solr there is a bit of a problem. The 
Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I 
know there is no guarantee that the ids of both matrices are in the same 
descending order. 

The easiest solution is to have an index for [B'B] and one for [B'A]. That 
means two or perhaps three queries for cross-recommendations, which is not 
ideal.

First I'm going to create two collections of docs with different field 
ids--this should work and we can merge them later.

Next we can do some m/r to group the docs by id so there is one collection 
(csv) with one line per doc. 

Alternatively it is a possible that the DRMs can be iterated simultaneously, 
which would also solve the problem. It assumes the order in both DRMs is the 
same, descending by Key = item ID. Even if a row is missing in one or the other 
this would work.

Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ 
creates [B'B] and matrix multiply creates [B'A]


On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Yes.  We need two different sets of documents if the row space of the
cross/co-occurrence matrices are different as is the case with A'B and B'B.

This could mean two indexes.

Or a single index with a special field to indicate what type of record you
have.


On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Thanks, well put.
 
 In order to have the ultimate impl with two id spaces for A and B would we
 have to create different docs for A'B and B'B? Since the docs IDs must come
 from A or B? The fields can contain different sets of IDs but the Doc ID
 must be one or the other, right? Doesn't this imply separate indexes for
 the separate A, B item IDs spaces? This is not a question for this first
 cut impl but is a generalization question.
 
 On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 So there is a lot of good discussion here and there were some key ideas.
 
 The first idea is that the *input* to a recommender is on the right in the
 matrix notation.  This refers inherently to the id's on the columns of the
 recommender product (either B'B or B'A).  The columns are defined by the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).
 
 The results are in the row space and are defined by the left hand operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B in
 both cases so the row space is consistent.
 
 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of B.
 The fields of the documents will necessarily include the following:
 
 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the column
 space of A where this row  of llr-filter(B'A) contains a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the column
 space of B ...
 
 
 The following operations are now single queries:
 
 get an item where id = x
  query is [id:x]
 
 recommend based on behavior with regard to A items and actions h_a
  query is [b-a-links: h_a]
 
 recommend based on behavior with regard to B items and actions h_b
  query is [b-b-links: h_b]
 
 recommend based on a single item with id = x
   query is [b-b-links: x]
 
 recommend based on composite behavior composed of h_a and h_b
   query is [b-a-links: h_a b-b-links: h_b]
 
 Does this make sense by being more explicit?
 
 Now, it is pretty clear that we could have an index of A objects as well
 but the link fields would have to be a-a-links and a-b-links, of course.
 
 
 
 
 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.
 
 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
 I am planning on sitting in as flaky connection allows.
 On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 We doing a hangout at 2 on the Solr recommender?

Re: Setting up a recommender

2013-08-05 Thread Ted Dunning

A quick map-reduce program should be able to join these matrices and
produce documents ready to index.


On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote:

 In writing the similarity matrices to Solr there is a bit of a problem.
 The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far
 as I know there is no guarantee that the ids of both matrices are in the
 same descending order.

 The easiest solution is to have an index for [B'B] and one for [B'A]. That
 means two or perhaps three queries for cross-recommendations, which is not
 ideal.

 First I'm going to create two collections of docs with different field
 ids--this should work and we can merge them later.

 Next we can do some m/r to group the docs by id so there is one collection
 (csv) with one line per doc.

 Alternatively it is a possible that the DRMs can be iterated
 simultaneously, which would also solve the problem. It assumes the order in
 both DRMs is the same, descending by Key = item ID. Even if a row is
 missing in one or the other this would work.

 Does anyone know if the DRMs are guaranteed to have row ordering by Key?
 RSJ creates [B'B] and matrix multiply creates [B'A]


 On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Yes.  We need two different sets of documents if the row space of the
 cross/co-occurrence matrices are different as is the case with A'B and B'B.

 This could mean two indexes.

 Or a single index with a special field to indicate what type of record you
 have.


 On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

  Thanks, well put.
 
  In order to have the ultimate impl with two id spaces for A and B would
 we
  have to create different docs for A'B and B'B? Since the docs IDs must
 come
  from A or B? The fields can contain different sets of IDs but the Doc ID
  must be one or the other, right? Doesn't this imply separate indexes for
  the separate A, B item IDs spaces? This is not a question for this first
  cut impl but is a generalization question.
 
  On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  So there is a lot of good discussion here and there were some key ideas.
 
  The first idea is that the *input* to a recommender is on the right in
 the
  matrix notation.  This refers inherently to the id's on the columns of
 the
  recommender product (either B'B or B'A).  The columns are defined by the
  right hand element of the product (either B or A in the B'B and B'A
  respectively).
 
  The results are in the row space and are defined by the left hand operand
  of the product.  IN the case of B'A and B'B, the left hand operand is B
 in
  both cases so the row space is consistent.
 
  In order to implement this in a search engine, we need documents that
  correspond to rows of B'A or B'B.  These are the same as the columns of
 B.
  The fields of the documents will necessarily include the following:
 
  id: the column id from B corresponding to this item
  description: presentation info ... yada yada
  b-a-links: contents of this row of B'A expressed as id's from the column
  space of A where this row  of llr-filter(B'A) contains a
  non-zero value.
  b-b-links: contents of this row of B'B expressed as id's from the column
  space of B ...
 
 
  The following operations are now single queries:
 
  get an item where id = x
   query is [id:x]
 
  recommend based on behavior with regard to A items and actions h_a
   query is [b-a-links: h_a]
 
  recommend based on behavior with regard to B items and actions h_b
   query is [b-b-links: h_b]
 
  recommend based on a single item with id = x
query is [b-b-links: x]
 
  recommend based on composite behavior composed of h_a and h_b
query is [b-a-links: h_a b-b-links: h_b]
 
  Does this make sense by being more explicit?
 
  Now, it is pretty clear that we could have an index of A objects as well
  but the link fields would have to be a-a-links and a-b-links, of course.
 
 
 
 
  On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  Assuming Ted needs to call it, not sure if an invite has gone out, I
  haven't seen one.
 
  On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
  I am planning on sitting in as flaky connection allows.
  On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  We doing a hangout at 2 on the Solr recommender?

Re: Setting up a recommender

2013-08-05 Thread Sebastian Schelter

If you use the same partitioning and number of reducers for creating the
outputs, the output should have the same number of sequence files and each
sequence file should have the same keys in descending order. I don't
understand why the ordering is a problem, can we not store the row index as
a field in solr?

2013/8/5 Ted Dunning ted.dunn...@gmail.com

 A quick map-reduce program should be able to join these matrices and
 produce documents ready to index.


 On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote:

  In writing the similarity matrices to Solr there is a bit of a problem.
  The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
 far
  as I know there is no guarantee that the ids of both matrices are in the
  same descending order.
 
  The easiest solution is to have an index for [B'B] and one for [B'A].
 That
  means two or perhaps three queries for cross-recommendations, which is
 not
  ideal.
 
  First I'm going to create two collections of docs with different field
  ids--this should work and we can merge them later.
 
  Next we can do some m/r to group the docs by id so there is one
 collection
  (csv) with one line per doc.
 
  Alternatively it is a possible that the DRMs can be iterated
  simultaneously, which would also solve the problem. It assumes the order
 in
  both DRMs is the same, descending by Key = item ID. Even if a row is
  missing in one or the other this would work.
 
  Does anyone know if the DRMs are guaranteed to have row ordering by Key?
  RSJ creates [B'B] and matrix multiply creates [B'A]
 
 
  On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  We need two different sets of documents if the row space of the
  cross/co-occurrence matrices are different as is the case with A'B and
 B'B.
 
  This could mean two indexes.
 
  Or a single index with a special field to indicate what type of record
 you
  have.
 
 
  On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
   Thanks, well put.
  
   In order to have the ultimate impl with two id spaces for A and B would
  we
   have to create different docs for A'B and B'B? Since the docs IDs must
  come
   from A or B? The fields can contain different sets of IDs but the Doc
 ID
   must be one or the other, right? Doesn't this imply separate indexes
 for
   the separate A, B item IDs spaces? This is not a question for this
 first
   cut impl but is a generalization question.
  
   On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
  
   So there is a lot of good discussion here and there were some key
 ideas.
  
   The first idea is that the *input* to a recommender is on the right in
  the
   matrix notation.  This refers inherently to the id's on the columns of
  the
   recommender product (either B'B or B'A).  The columns are defined by
 the
   right hand element of the product (either B or A in the B'B and B'A
   respectively).
  
   The results are in the row space and are defined by the left hand
 operand
   of the product.  IN the case of B'A and B'B, the left hand operand is B
  in
   both cases so the row space is consistent.
  
   In order to implement this in a search engine, we need documents that
   correspond to rows of B'A or B'B.  These are the same as the columns of
  B.
   The fields of the documents will necessarily include the following:
  
   id: the column id from B corresponding to this item
   description: presentation info ... yada yada
   b-a-links: contents of this row of B'A expressed as id's from the
 column
   space of A where this row  of llr-filter(B'A) contains
 a
   non-zero value.
   b-b-links: contents of this row of B'B expressed as id's from the
 column
   space of B ...
  
  
   The following operations are now single queries:
  
   get an item where id = x
query is [id:x]
  
   recommend based on behavior with regard to A items and actions h_a
query is [b-a-links: h_a]
  
   recommend based on behavior with regard to B items and actions h_b
query is [b-b-links: h_b]
  
   recommend based on a single item with id = x
 query is [b-b-links: x]
  
   recommend based on composite behavior composed of h_a and h_b
 query is [b-a-links: h_a b-b-links: h_b]
  
   Does this make sense by being more explicit?
  
   Now, it is pretty clear that we could have an index of A objects as
 well
   but the link fields would have to be a-a-links and a-b-links, of
 course.
  
  
  
  
   On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
  
   Assuming Ted needs to call it, not sure if an invite has gone out, I
   haven't seen one.
  
   On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
  
   I am planning on sitting in as flaky connection allows.
   On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
  
   We doing a hangout at 2 on the Solr recommender?

Re: Setting up a recommender

2013-08-05 Thread Pat Ferrel

I think m/r join is the best solution, too many assumptions otherwise. I 
thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. Is 
there a good example to start from in Mahout? 

Yes, one id field per doc. The problem is not storing, it is joining rows from 
two DRMs by simple iteration.

On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote:

If you use the same partitioning and number of reducers for creating the
outputs, the output should have the same number of sequence files and each
sequence file should have the same keys in descending order. I don't
understand why the ordering is a problem, can we not store the row index as
a field in solr?

2013/8/5 Ted Dunning ted.dunn...@gmail.com

 A quick map-reduce program should be able to join these matrices and
 produce documents ready to index.
 
 
 On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 In writing the similarity matrices to Solr there is a bit of a problem.
 The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
 far
 as I know there is no guarantee that the ids of both matrices are in the
 same descending order.
 
 The easiest solution is to have an index for [B'B] and one for [B'A].
 That
 means two or perhaps three queries for cross-recommendations, which is
 not
 ideal.
 
 First I'm going to create two collections of docs with different field
 ids--this should work and we can merge them later.
 
 Next we can do some m/r to group the docs by id so there is one
 collection
 (csv) with one line per doc.
 
 Alternatively it is a possible that the DRMs can be iterated
 simultaneously, which would also solve the problem. It assumes the order
 in
 both DRMs is the same, descending by Key = item ID. Even if a row is
 missing in one or the other this would work.
 
 Does anyone know if the DRMs are guaranteed to have row ordering by Key?
 RSJ creates [B'B] and matrix multiply creates [B'A]
 
 
 On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  We need two different sets of documents if the row space of the
 cross/co-occurrence matrices are different as is the case with A'B and
 B'B.
 
 This could mean two indexes.
 
 Or a single index with a special field to indicate what type of record
 you
 have.
 
 
 On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
 Thanks, well put.
 
 In order to have the ultimate impl with two id spaces for A and B would
 we
 have to create different docs for A'B and B'B? Since the docs IDs must
 come
 from A or B? The fields can contain different sets of IDs but the Doc
 ID
 must be one or the other, right? Doesn't this imply separate indexes
 for
 the separate A, B item IDs spaces? This is not a question for this
 first
 cut impl but is a generalization question.
 
 On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 So there is a lot of good discussion here and there were some key
 ideas.
 
 The first idea is that the *input* to a recommender is on the right in
 the
 matrix notation.  This refers inherently to the id's on the columns of
 the
 recommender product (either B'B or B'A).  The columns are defined by
 the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).
 
 The results are in the row space and are defined by the left hand
 operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B
 in
 both cases so the row space is consistent.
 
 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of
 B.
 The fields of the documents will necessarily include the following:
 
 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the
 column
 space of A where this row  of llr-filter(B'A) contains
 a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the
 column
 space of B ...
 
 
 The following operations are now single queries:
 
 get an item where id = x
 query is [id:x]
 
 recommend based on behavior with regard to A items and actions h_a
 query is [b-a-links: h_a]
 
 recommend based on behavior with regard to B items and actions h_b
 query is [b-b-links: h_b]
 
 recommend based on a single item with id = x
  query is [b-b-links: x]
 
 recommend based on composite behavior composed of h_a and h_b
  query is [b-a-links: h_a b-b-links: h_b]
 
 Does this make sense by being more explicit?
 
 Now, it is pretty clear that we could have an index of A objects as
 well
 but the link fields would have to be a-a-links and a-b-links, of
 course.
 
 
 
 
 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.
 
 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com

Re: Setting up a recommender

2013-08-05 Thread Sebastian Schelter

I still don't understand why we need to rely on docids. If we simply index
that row A is similar to rows B, C and D that should be fine, or am I wrong?

2013/8/5 Pat Ferrel p...@occamsmachete.com

 I think m/r join is the best solution, too many assumptions otherwise. I
 thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r.
 Is there a good example to start from in Mahout?

 Yes, one id field per doc. The problem is not storing, it is joining rows
 from two DRMs by simple iteration.

 On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote:

 If you use the same partitioning and number of reducers for creating the
 outputs, the output should have the same number of sequence files and each
 sequence file should have the same keys in descending order. I don't
 understand why the ordering is a problem, can we not store the row index as
 a field in solr?

 2013/8/5 Ted Dunning ted.dunn...@gmail.com

  A quick map-reduce program should be able to join these matrices and
  produce documents ready to index.
 
 
  On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
  In writing the similarity matrices to Solr there is a bit of a problem.
  The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
  far
  as I know there is no guarantee that the ids of both matrices are in the
  same descending order.
 
  The easiest solution is to have an index for [B'B] and one for [B'A].
  That
  means two or perhaps three queries for cross-recommendations, which is
  not
  ideal.
 
  First I'm going to create two collections of docs with different field
  ids--this should work and we can merge them later.
 
  Next we can do some m/r to group the docs by id so there is one
  collection
  (csv) with one line per doc.
 
  Alternatively it is a possible that the DRMs can be iterated
  simultaneously, which would also solve the problem. It assumes the order
  in
  both DRMs is the same, descending by Key = item ID. Even if a row is
  missing in one or the other this would work.
 
  Does anyone know if the DRMs are guaranteed to have row ordering by Key?
  RSJ creates [B'B] and matrix multiply creates [B'A]
 
 
  On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  We need two different sets of documents if the row space of the
  cross/co-occurrence matrices are different as is the case with A'B and
  B'B.
 
  This could mean two indexes.
 
  Or a single index with a special field to indicate what type of record
  you
  have.
 
 
  On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
  wrote:
 
  Thanks, well put.
 
  In order to have the ultimate impl with two id spaces for A and B would
  we
  have to create different docs for A'B and B'B? Since the docs IDs must
  come
  from A or B? The fields can contain different sets of IDs but the Doc
  ID
  must be one or the other, right? Doesn't this imply separate indexes
  for
  the separate A, B item IDs spaces? This is not a question for this
  first
  cut impl but is a generalization question.
 
  On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  So there is a lot of good discussion here and there were some key
  ideas.
 
  The first idea is that the *input* to a recommender is on the right in
  the
  matrix notation.  This refers inherently to the id's on the columns of
  the
  recommender product (either B'B or B'A).  The columns are defined by
  the
  right hand element of the product (either B or A in the B'B and B'A
  respectively).
 
  The results are in the row space and are defined by the left hand
  operand
  of the product.  IN the case of B'A and B'B, the left hand operand is B
  in
  both cases so the row space is consistent.
 
  In order to implement this in a search engine, we need documents that
  correspond to rows of B'A or B'B.  These are the same as the columns of
  B.
  The fields of the documents will necessarily include the following:
 
  id: the column id from B corresponding to this item
  description: presentation info ... yada yada
  b-a-links: contents of this row of B'A expressed as id's from the
  column
  space of A where this row  of llr-filter(B'A) contains
  a
  non-zero value.
  b-b-links: contents of this row of B'B expressed as id's from the
  column
  space of B ...
 
 
  The following operations are now single queries:
 
  get an item where id = x
  query is [id:x]
 
  recommend based on behavior with regard to A items and actions h_a
  query is [b-a-links: h_a]
 
  recommend based on behavior with regard to B items and actions h_b
  query is [b-b-links: h_b]
 
  recommend based on a single item with id = x
   query is [b-b-links: x]
 
  recommend based on composite behavior composed of h_a and h_b
   query is [b-a-links: h_a b-b-links: h_b]
 
  Does this make sense by being more explicit?
 
  Now, it is pretty clear that we could have an index of A objects as
  well
  but the

Re: Setting up a recommender

2013-08-05 Thread Ted Dunning

Sebastian,

There needs to be a join of the two row similarity matrices to form
documents.

Pat,

What about just updating the document with the fields?  Have three passes.
 Pass 1 puts the normal meta-data for the item in place.  Pass2 updates
with data from B'B.  Pass 3 udpates with data from B'A.

This will cause the entire index to be rewritten more than necessary, but
it should be fast enough to be a non-issue.

On other fronts, I got musicbrainz downloaded over the weekend and have
figured out the schema enough so that I think I can produce recording,
artist and tag information.  From that, I can simulate user behavior and
produce logs to push into the demo system.  That will allow realistic scale
and will allow users to explore the system in terms that they understand.

There is still a question of whether we can redistribute the musicbrainz
data, but I think I can arrange it so that anybody who wants to run the
demo will just download the necessary data themselves.  I may host a
derived data product myself to simplify that process.



On Mon, Aug 5, 2013 at 10:59 AM, Sebastian Schelter s...@apache.org wrote:

 I still don't understand why we need to rely on docids. If we simply index
 that row A is similar to rows B, C and D that should be fine, or am I
 wrong?

 2013/8/5 Pat Ferrel p...@occamsmachete.com

  I think m/r join is the best solution, too many assumptions otherwise. I
  thought Ted wanted a non-m/r implementation, but oh, well, mostly
 non-m/r.
  Is there a good example to start from in Mahout?
 
  Yes, one id field per doc. The problem is not storing, it is joining rows
  from two DRMs by simple iteration.
 
  On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote:
 
  If you use the same partitioning and number of reducers for creating the
  outputs, the output should have the same number of sequence files and
 each
  sequence file should have the same keys in descending order. I don't
  understand why the ordering is a problem, can we not store the row index
 as
  a field in solr?
 
  2013/8/5 Ted Dunning ted.dunn...@gmail.com
 
   A quick map-reduce program should be able to join these matrices and
   produce documents ready to index.
  
  
   On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com
  wrote:
  
   In writing the similarity matrices to Solr there is a bit of a
 problem.
   The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
   far
   as I know there is no guarantee that the ids of both matrices are in
 the
   same descending order.
  
   The easiest solution is to have an index for [B'B] and one for [B'A].
   That
   means two or perhaps three queries for cross-recommendations, which is
   not
   ideal.
  
   First I'm going to create two collections of docs with different field
   ids--this should work and we can merge them later.
  
   Next we can do some m/r to group the docs by id so there is one
   collection
   (csv) with one line per doc.
  
   Alternatively it is a possible that the DRMs can be iterated
   simultaneously, which would also solve the problem. It assumes the
 order
   in
   both DRMs is the same, descending by Key = item ID. Even if a row is
   missing in one or the other this would work.
  
   Does anyone know if the DRMs are guaranteed to have row ordering by
 Key?
   RSJ creates [B'B] and matrix multiply creates [B'A]
  
  
   On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  
   Yes.  We need two different sets of documents if the row space of the
   cross/co-occurrence matrices are different as is the case with A'B and
   B'B.
  
   This could mean two indexes.
  
   Or a single index with a special field to indicate what type of record
   you
   have.
  
  
   On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
   wrote:
  
   Thanks, well put.
  
   In order to have the ultimate impl with two id spaces for A and B
 would
   we
   have to create different docs for A'B and B'B? Since the docs IDs
 must
   come
   from A or B? The fields can contain different sets of IDs but the Doc
   ID
   must be one or the other, right? Doesn't this imply separate indexes
   for
   the separate A, B item IDs spaces? This is not a question for this
   first
   cut impl but is a generalization question.
  
   On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  
   So there is a lot of good discussion here and there were some key
   ideas.
  
   The first idea is that the *input* to a recommender is on the right
 in
   the
   matrix notation.  This refers inherently to the id's on the columns
 of
   the
   recommender product (either B'B or B'A).  The columns are defined by
   the
   right hand element of the product (either B or A in the B'B and B'A
   respectively).
  
   The results are in the row space and are defined by the left hand
   operand
   of the product.  IN the case of B'A and B'B, the left hand operand
 is B
   in
   both cases so the row

Re: Setting up a recommender

2013-08-05 Thread Pat Ferrel

Yeah thought of that one too but it still requires each be ordered by Key, in 
which case simultaneous iteration works in one pass I think.

If the DRMs are always sorted by Key you can iterate through each at the same 
time, writing only when you have both fields or know there is a field missing 
from one DRM. If you get the same key you write a combined doc, if you have 
different ones, write out one sided until it catches up to the other.

Every DRM I've examined seems to be ordered by key and I assume that is not an 
artifact of seqdumper. I'm using SequenceFileDirIterator so the part file 
splits aren't a problem.

A m/r join is pretty simple too but I'll go with non-m/r unless there is a 
problem above.

BTW the schema for the Solr csv is:
id,b_b_links,b_a_links
item1,itemX itemY,itemZ

am I missing some normal metadata?

 On Aug 5, 2013, at 11:05 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 What about just updating the document with the fields?  Have three passes.
 Pass 1 puts the normal meta-data for the item in place.  Pass2 updates
 with data from B'B.  Pass 3 udpates with data from B'A.
 
 This will cause the entire index to be rewritten more than necessary, but
 it should be fast enough to be a non-issue.

Re: Setting up a recommender

2013-08-05 Thread Ted Dunning

On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yeah thought of that one too but it still requires each be ordered by Key,
 in which case simultaneous iteration works in one pass I think.


Multipass does not require ordering by key.  Solr documents can be updated
in any order.


 If the DRMs are always sorted by Key you can iterate through each at the
 same time, writing only when you have both fields or know there is a field
 missing from one DRM. If you get the same key you write a combined doc, if
 you have different ones, write out one sided until it catches up to the
 other.


Yes.  Merge will work when files are ordered and split consistently.  I
don't think we should be making that assumption.


 Every DRM I've examined seems to be ordered by key and I assume that is
 not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part
 file splits aren't a problem.


But with the co- and cross- occurrence stuff, file splits could be a
problem.


 A m/r join is pretty simple too but I'll go with non-m/r unless there is a
 problem above.


The simplest join is to use Solr updates.  This would require a minimal
amount of programming, but less than writing a merge program.


 BTW the schema for the Solr csv is:
 id,b_b_links,b_a_links
 item1,itemX itemY,itemZ

 am I missing some normal metadata?


An item description is nice.

Re: Setting up a recommender

2013-08-05 Thread Johannes Schulte

we have a cross recommender in production for about 3 month now, with the
difference that we use lucene to build indices from map reduce directly
plus we do the same thing for 30+ customers, most of them with different
input data structure (field names, values).

we had something similar before (lucene, multiple cross relations) but also
used the similarity score (llr) with a custom similarity and payloads but
switched tp pure tedism after some helpful comments here. therefore i
read this thread with a lot of interest.

what i can add from my experiences:

1. i find it way easier to not talk about in this in matrix multiplication
language but with contigency tables ( a and b, a and not b, not a and b,
not a and not b), and also find the usage of the classical mahout
similarity jobs hard. this is probably because of my basic matrix math
skills, but also because using matrices leads to id usage and often the
extracted items are text (search term, country, page section). thinking of
this as related terms automatically gives a document view on the item to
be recommended (the lucene doc) where description, name and everything is
also just a field.

2. when doing a simple table it's just cooccurrences, marginals and totals.
since the dimension of marginals is often not too big (items, browser,
terms), we right now accumulate the counts in memory. maybe the
 RowSimilarityJob is working the same way. This can be changed to a
different implementaton like on disk hash table or even count min sketch,
if the number of items is too large. Main point is that the counting of
marginals can be done on the fly when emitting all ooccurrences.

3. above in the thread there was a tip on approaching similarity scores
with repeating terms. payloads are a better way for this and with lucene
4's doc values capability, there shouldn't be any mahout similarity not
expressible by a lucene similarity. maybe it would be helpful to provide a
lucene delivery system also for the classic mahout recommender package.
it adds soo many possibilities for filtering and takes away a lot of point
like caching etc.

4. a big question is the frequency of rebuilding. while the relations can
often stay untouched for a day, the item data may change way more often
(item churn, new items). it is therefore beneficial to separate those and
have the possibility to rebuild the final index without calulating all
similarities again (for very critical things this often leads to a lucene
filter querying some external source to build up a lucene filter that
restricts the index)

besides that, i am very happy to see the ongoing effort on this topic and
hope that i can contribute with something someday.

Cheers,
Johannes




On Mon, Aug 5, 2013 at 10:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  Yeah thought of that one too but it still requires each be ordered by
 Key,
  in which case simultaneous iteration works in one pass I think.
 

 Multipass does not require ordering by key.  Solr documents can be updated
 in any order.


  If the DRMs are always sorted by Key you can iterate through each at the
  same time, writing only when you have both fields or know there is a
 field
  missing from one DRM. If you get the same key you write a combined doc,
 if
  you have different ones, write out one sided until it catches up to the
  other.
 

 Yes.  Merge will work when files are ordered and split consistently.  I
 don't think we should be making that assumption.


  Every DRM I've examined seems to be ordered by key and I assume that is
  not an artifact of seqdumper. I'm using SequenceFileDirIterator so the
 part
  file splits aren't a problem.
 

 But with the co- and cross- occurrence stuff, file splits could be a
 problem.


  A m/r join is pretty simple too but I'll go with non-m/r unless there is
 a
  problem above.
 

 The simplest join is to use Solr updates.  This would require a minimal
 amount of programming, but less than writing a merge program.


  BTW the schema for the Solr csv is:
  id,b_b_links,b_a_links
  item1,itemX itemY,itemZ
 
  am I missing some normal metadata?
 

 An item description is nice.

Re: Setting up a recommender

2013-08-03 Thread Ted Dunning

Yes.  We need two different sets of documents if the row space of the
cross/co-occurrence matrices are different as is the case with A'B and B'B.

This could mean two indexes.

Or a single index with a special field to indicate what type of record you
have.


On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Thanks, well put.

 In order to have the ultimate impl with two id spaces for A and B would we
 have to create different docs for A'B and B'B? Since the docs IDs must come
 from A or B? The fields can contain different sets of IDs but the Doc ID
 must be one or the other, right? Doesn't this imply separate indexes for
 the separate A, B item IDs spaces? This is not a question for this first
 cut impl but is a generalization question.

 On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 So there is a lot of good discussion here and there were some key ideas.

 The first idea is that the *input* to a recommender is on the right in the
 matrix notation.  This refers inherently to the id's on the columns of the
 recommender product (either B'B or B'A).  The columns are defined by the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).

 The results are in the row space and are defined by the left hand operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B in
 both cases so the row space is consistent.

 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of B.
 The fields of the documents will necessarily include the following:

 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the column
 space of A where this row  of llr-filter(B'A) contains a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the column
 space of B ...


 The following operations are now single queries:

 get an item where id = x
   query is [id:x]

 recommend based on behavior with regard to A items and actions h_a
   query is [b-a-links: h_a]

 recommend based on behavior with regard to B items and actions h_b
   query is [b-b-links: h_b]

 recommend based on a single item with id = x
query is [b-b-links: x]

 recommend based on composite behavior composed of h_a and h_b
query is [b-a-links: h_a b-b-links: h_b]

 Does this make sense by being more explicit?

 Now, it is pretty clear that we could have an index of A objects as well
 but the link fields would have to be a-a-links and a-b-links, of course.




 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Assuming Ted needs to call it, not sure if an invite has gone out, I
  haven't seen one.
 
  On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
  I am planning on sitting in as flaky connection allows.
  On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  We doing a hangout at 2 on the Solr recommender?

Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel

I put some thought into this (actually I slept on it) and I think the answer is
in the math.

-- A = matrix of action2 by user, used for cross-action recommendations, for
instance action2 = views.
-- B = matrix of action1 by user, these are the primary recommenders actions,
for instance action1 = purchases.
-- H_a1 = all user history of action1 in column vectors. This may be all
action1's recorded and so may = B' or it may have truncated history to get more
recent activity in recs.
-- H_a2 = all user history of action2 in column vectors. This may be all
action2's recorded and so may = A' or it may have truncated history to get more
recent activity in recs.
-- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for
action1.
-- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was
also an action1. recommendation are for action1.
-- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are
weighted to optimize results.

The query on [B'A] will be column vectors from H_a2. Each is a user's history
of action2 on A items. That is if there were different items in A than B then
the query would be comprised of those items and against the field that contains
those items. This brings up a bunch of other questions but for now we do not
have separate items.

It illustrates the fact that the query is user history of action2 so the items
(though they have the same ID space in this case) should be from A or there
would be no hits.

Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows
are the same as columns.

The confusion may come from the fact that Ted's mental model does not have the
same items for both A and B. So the document ID cannot = item ID since the docs
contain items from both item ID spaces. In which case I don't know why they
would be in the same doc at all but that is another discussion. This model does
not allow us to fetch a doc by ID.

But in our case since we have the same IDs in A and B we can put them in a doc
of ID=item ID, the field similair_items can contain items from B
similarityMatrix rows since they are the same as columns, the
cross_action_similar_items field will contain columns from [B'A]

This may just be mental looping--sleep only work about 50% of the time for me
so maybe someone else can check this reasoning. Have a look at the data here
https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx

On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Yes, storing the similar_items in a field, cross_action_similar_items in
another field all on the same doc ided by item ID. Agree that there may be
other fields.

Storing the rows of [B'B] is ok because it's symmetric. However we did talk
about the [B'A] case and I thought we agreed to store the rows there too
because they were from Bs items. This was the discussion about having different
items for cross actions. The excerpt below is Ted responding to my question. So
do we want the columns of [B'A]? It's only a transpose away.

On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:
[B'A] =
iphone ipadnexus galaxy surface
iphone 2 2 2 1 0
ipad2 2 2 1 0
nexus 1 1 1 1 0
galaxy 1 1 1 1 0
surface 0 0 0 0 1

The rows are what we want from [B'A] since the row items are from B, right?

Yes.

It is easier to understand if you have different kinds of items as well as
different actions. For instance, suppose that you have user x query terms
(A) and user x device (B). B'A is then device x term so that there is a row
per device and the row contains terms. This is good when searching for
devices using terms.

Talking about getting the actual doc field values, which will include the
similar_items field and other metadata. The actual ids in the similar_items
field work well for anonymous/no-history recs but maybe there is a second query
or fetch that I'm missing? I assumed that a fetch of the doc and it's fields
by item ID was as fast a way to do this as possible. If there is some way to
get the same result by doing a query that is faster, I'm all for it?

Can do tomorrow at 2.

Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel

Apologies for thrashing--definitely doing some mental looping but look at the
cross-similarities on the Template sheet of the Excel file. The rows of [B'A]
intuitively look best.

Specifically there was a user who viewed the Surface and Nexus but the columns
do not account for that, the rows do.

Going from rows to columns is the trivial addition of a transpose so I'm going
to go ahead with rows for now. This affects the cross_action_similar_items and
so only the cross-recommender part of the whole.

On Aug 2, 2013, at 8:00 AM, Pat Ferrel pat.fer...@gmail.com wrote:

I put some thought into this (actually I slept on it) and I think the answer is
in the math.

It illustrates the fact that the query is user history of action2 so the items
(though they have the same ID space in this case) should be from A or there
would be no hits.

Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows
are the same as columns.

On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Yes, storing the similar_items in a field, cross_action_similar_items in
another field all on the same doc ided by item ID. Agree that there may be
other fields.

On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:
[B'A] =
iphone ipadnexus galaxy surface
iphone 2 2 2 1 0
ipad2 2 2 1 0
nexus 1 1 1 1 0
galaxy 1 1 1 1 0
surface 0 0 0 0 1

The rows are what we want from [B'A] since the row items are from B, right?

Yes.

Can do tomorrow at 2.

Re: Setting up a recommender

2013-08-02 Thread B Lyon

I think the sheet is very helpful.

I was wondering about having at least one of the examples be where the
actions deal with completely different things to maybe make it easier for
newbies like me to grok the main points: purchases of items of type blah
and views of videos, say. I think the input file has the same setup etc.

I don't get the issue/questions that come up when we do have separate
items. And I thought Ted mentioned at one point that the weighting of
recommendation vectors might not be necessary based on some kind of solr
magic, but I have no idea what that is.

Btw, i was already thinking of doing something for my own
clarification/edification that is similar to your spreadsheet, but would be
a web page where a mouseover on one piece highlights the other pieces that
generated it... E.g. The way the links in this pagerank explorer highlight
the relevant portions of the google matrix (
https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are lots
of other different pieces here of course,
but show connections soup-to-nuts as much as possible.

On Friday, August 2, 2013, Pat Ferrel wrote:

I put some thought into this (actually I slept on it) and I think the
answer is in the math.

-- A = matrix of action2 by user, used for cross-action recommendations,
for instance action2 = views.
-- B = matrix of action1 by user, these are the primary recommenders
actions, for instance action1 = purchases.
-- H_a1 = all user history of action1 in column vectors. This may be all
action1's recorded and so may = B' or it may have truncated history to get
more recent activity in recs.
-- H_a2 = all user history of action2 in column vectors. This may be all
action2's recorded and so may = A' or it may have truncated history to get
more recent activity in recs.
-- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for
action1.
-- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there
was also an action1. recommendation are for action1.
-- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they
are weighted to optimize results.

The query on [B'A] will be column vectors from H_a2. Each is a user's
history of action2 on A items. That is if there were different items in A
than B then the query would be comprised of those items and against the
field that contains those items. This brings up a bunch of other questions
but for now we do not have separate items.

It illustrates the fact that the query is user history of action2 so the
items (though they have the same ID space in this case) should be from A or
there would be no hits.

Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so
rows are the same as columns.

The confusion may come from the fact that Ted's mental model does not have
the same items for both A and B. So the document ID cannot = item ID since
the docs contain items from both item ID spaces. In which case I don't know
why they would be in the same doc at all but that is another discussion.
This model does not allow us to fetch a doc by ID.

But in our case since we have the same IDs in A and B we can put them in a
doc of ID=item ID, the field similair_items can contain items from B
similarityMatrix rows since they are the same as columns, the
cross_action_similar_items field will contain columns from [B'A]

This may just be mental looping--sleep only work about 50% of the time for
me so maybe someone else can check this reasoning. Have a look at the data
here
https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx

On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.comjavascript:;
wrote:

Yes, storing the similar_items in a field, cross_action_similar_items in
another field all on the same doc ided by item ID. Agree that there may be
other fields.

Storing the rows of [B'B] is ok because it's symmetric. However we did
talk about the [B'A] case and I thought we agreed to store the rows there
too because they were from Bs items. This was the discussion about having
different items for cross actions. The excerpt below is Ted responding to
my question. So do we want the columns of [B'A]? It's only a transpose away.

On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel
p...@occamsmachete.comjavascript:;
wrote:
[B'A] =
iphone ipadnexus galaxy surface
iphone 2 2 2 1 0
ipad2 2 2 1 0
nexus 1 1 1 1 0
galaxy 1 1 1 1 0
surface 0 0 0 0 1

The rows are what we want from [B'A] since the row items are from B,
right?

Yes.

It is easier to understand if you have different kinds of items as well
as different actions. For instance, suppose that you have user x query
terms (A) and user x device (B). B'A is then device x term so that there
is a row per device and the

Re: Setting up a recommender

2013-08-02 Thread Andrew Psaltis

On 8/2/13 12:13 PM, B Lyon bradfl...@gmail.com wrote:


The way the links in this pagerank explorer highlight
the relevant portions of the google matrix (
https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are
lots

That is pretty darn cool, great job!

Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel

This first cut project explicitly assumes a unified user and item space. This
works well for many action pairs, not for others. The reason I did this to
begin with was for using multiple actions for ecom recs. Views were not very
predictive of purchases alone and needed the cross-recommender treatment. We
did this using Mahout matrix math so the issue of what to write to Solr did not
come up. It worked fine but now we find the need for an online method that will
make use of realtime generated preferences, so ones not in the batch training
data.

The math still works for multiple item spaces but users must be in common. More
generally the rank and ID space currently associated with users must be the
same.

Feel free to create examples if you want. Ted has some ideas for using multiple
item spaces in presos that are on Slideshare I think.

On Aug 2, 2013, at 10:13 AM, B Lyon bradfl...@gmail.com wrote: