Re: Setting up a recommender

2014-04-21 Thread Frank Scholten
Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?

Cheers,

Frank


On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I finally got some time to work on this and have a first cut at output to
 Solr working on the github repo. It only works on 2-action input but I'll
 have that cleaned up soon so it will work with one action. Solr indexing
 has not been tested yet and the field names and/or types may need tweaking.

 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence

 There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.

 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,history_b,history_a
 user1,iphone ipad,iphone ipad galaxy
 ...

 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,b_b_links,b_a_links
 u1,iphone ipad,iphone ipad galaxy
 …

 It may work on a cluster, I haven't tried yet. As soon as someone has some
 large-ish sample log files I'll give them a try. Check the sample input
 files in the resources dir for format.

 https://github.com/pferrel/solr-recommender


 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:

 When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?

 However getting further into this I see one very large benefit. It has one
 feature that sets it completely apart from the typical NoSQL db. The type
 of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.

 Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.



 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Yes.  That would be interesting.




 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

  A little digression: Might a Matrix implementation backed by a Solr index
  and uses SolrJ for querying help at all for the Solr recommendation
  approach?
 
  It supports multiple fields of String, Text, or boolean flags.
 
  Best
  Gokhan
 
 
  On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with
 three
  fields, id, A item history, and B item history. Other fields could be
  added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great to
  have example lines for two actions with or without the same item IDs.
  I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the
 one-item-space
  is actually a problem. It just means one item dictionary. A and B will
  have
  the right content, all I have to do is make sure the right ranks are
  input
  to the MM,
  Transpose, and RSJ. This in turn is only one extra count of the # of
  items
  in A's item space. This should be a very easy change If my thinking is
  correct.
 
 
  On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  4) To add more metadata to the Solr output will be left to the consumer
  for now. If there is a good data set to use we can illustrate how to do
  it
  in the project. Ted may have some data for this from musicbrainz.
 
 
  I am working on this issue now.
 
  The 

Re: Setting up a recommender

2014-04-21 Thread Pat Ferrel
Yes the cooccurrence item similarity matrix is calculated using LLR using 
Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix 
these days. 

The indicator matrix is then translated from a SequenceFile into a CSV (or 
other text delimited file) which looks like a list of itemIDs—tokens or terms 
in Solr parlance—for each item. These documents are indexed by Solr and the 
query is the user history.

[B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is 
“multiplied” by the indicator matrix by using it as the Solr query against the 
indicator matrix, actually producing a cosine similarity ranked list of items.

You have to squint a little to see the math. Any matrix product can be 
substituted with a row to column similarity metric assuming dimensionality is 
correct. So the product in all the equations should be interpreted as such. So 
to get recs for a user [B’B]h is done in two phases, one calculates [B’B] and 
one is a Solr query that adds the ‘h’ to the equation.

In this project https://github.com/pferrel/solr-recommender both [B’B] and 
[A’B] are calculated, the later uses actual matrix multiply, since we did not 
have a cross-RSJ at the time. Now that we have a cross cooccurrence in the 
Spark Scala Mahout 2 stuff I’ll rewrite the code to use it.

The cross indicator matrix allows you to use two different actions to predict a 
target action. So for example views that are similar to purchases can be used 
to recommend purchases. Take a look at the readme on github it has a quick 
review of the theory.

BTW there is a video recommender site that demos some interesting uses of Solr 
to blend collaborative filtering recs with metadata. It even makes recs based 
of of your most recent detail views on the site. That last doesn’t work all 
that well because it is really a cross recommendation and that isn’t built into 
the site yet. https://guide.finderbots.com


On Apr 21, 2014, at 12:11 PM, Frank Scholten fr...@frankscholten.nl wrote:

Pat and Ted: I am late to the party but this is very interesting!

I am not sure I understand all the steps, though. Do you still create a
cooccurrence matrix and compute LLR scores during this process or do you
only compute matrix multiplication times the history vector: B'B * h and
B'A * h?

Cheers,

Frank


On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 I finally got some time to work on this and have a first cut at output to
 Solr working on the github repo. It only works on 2-action input but I'll
 have that cleaned up soon so it will work with one action. Solr indexing
 has not been tested yet and the field names and/or types may need tweaking.
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,history_b,history_a
 user1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
 id,b_b_links,b_a_links
 u1,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some
 large-ish sample log files I'll give them a try. Check the sample input
 files in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one
 feature that sets it completely apart from the typical NoSQL db. The type
 of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning 

Re: Setting up a recommender

2014-04-21 Thread Ted Dunning
RowSimilarityJob is the guts of the work, but ItemSimilarityJob is usually
easier packaging for users.




On Mon, Apr 21, 2014 at 1:00 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Yes the cooccurrence item similarity matrix is calculated using LLR using
 Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix
 these days.

 The indicator matrix is then translated from a SequenceFile into a CSV (or
 other text delimited file) which looks like a list of itemIDs—tokens or
 terms in Solr parlance—for each item. These documents are indexed by Solr
 and the query is the user history.

 [B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is
 “multiplied” by the indicator matrix by using it as the Solr query against
 the indicator matrix, actually producing a cosine similarity ranked list of
 items.

 You have to squint a little to see the math. Any matrix product can be
 substituted with a row to column similarity metric assuming dimensionality
 is correct. So the product in all the equations should be interpreted as
 such. So to get recs for a user [B’B]h is done in two phases, one
 calculates [B’B] and one is a Solr query that adds the ‘h’ to the equation.

 In this project https://github.com/pferrel/solr-recommender both [B’B]
 and [A’B] are calculated, the later uses actual matrix multiply, since we
 did not have a cross-RSJ at the time. Now that we have a cross cooccurrence
 in the Spark Scala Mahout 2 stuff I’ll rewrite the code to use it.

 The cross indicator matrix allows you to use two different actions to
 predict a target action. So for example views that are similar to purchases
 can be used to recommend purchases. Take a look at the readme on github it
 has a quick review of the theory.

 BTW there is a video recommender site that demos some interesting uses of
 Solr to blend collaborative filtering recs with metadata. It even makes
 recs based of of your most recent detail views on the site. That last
 doesn’t work all that well because it is really a cross recommendation and
 that isn’t built into the site yet. https://guide.finderbots.com


 On Apr 21, 2014, at 12:11 PM, Frank Scholten fr...@frankscholten.nl
 wrote:

 Pat and Ted: I am late to the party but this is very interesting!

 I am not sure I understand all the steps, though. Do you still create a
 cooccurrence matrix and compute LLR scores during this process or do you
 only compute matrix multiplication times the history vector: B'B * h and
 B'A * h?

 Cheers,

 Frank


 On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  I finally got some time to work on this and have a first cut at output to
  Solr working on the github repo. It only works on 2-action input but I'll
  have that cleaned up soon so it will work with one action. Solr indexing
  has not been tested yet and the field names and/or types may need
 tweaking.
 
  It takes the result of the previous drop:
  1) DRMs for B (user history or B items action1) and A (user history of A
  items action2)
  2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
  There are two final outputs created using mapreduce but requiring 2
  in-memory hashmaps. I think this will work on a cluster (the hashmaps are
  instantiated on each node) but haven't tried yet. It orders items in #2
  fields by strength of link, which is the similarity value used in [B'B]
  or [B'A]. It would be nice to order #1 by recency but there is no
 provision
  for passing through timestamps at present so they are ordered by the
  strength of preference. This is probably not useful and so can be
 ignored.
  Ordering by recency might be useful for truncating queries by recency
 while
  leaving the training data containing 100% of available history.
 
  1) It joins #1 DRMs to produce a single set of docs in CSV form, which
  looks like this:
  id,history_b,history_a
  user1,iphone ipad,iphone ipad galaxy
  ...
 
  2) it joins #2 DRMs to produce a single set of docs in CSV form, which
  looks like this:
  id,b_b_links,b_a_links
  u1,iphone ipad,iphone ipad galaxy
  …
 
  It may work on a cluster, I haven't tried yet. As soon as someone has
 some
  large-ish sample log files I'll give them a try. Check the sample input
  files in the resources dir for format.
 
  https://github.com/pferrel/solr-recommender
 
 
  On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  When I started looking at this I was a bit skeptical. As a Search engine
  Solr may be peerless, but as yet another NoSQL db?
 
  However getting further into this I see one very large benefit. It has
 one
  feature that sets it completely apart from the typical NoSQL db. The type
  of queries you do return fuzzy results--in the very best sense of that
  word. The most interesting queries are based on similarity to some
  exemplar. Results are returned in order of similarity strength, not
 ordered
  by a sort field.
 
  Wherever similarity based queries are important 

Re: Setting up a recommender

2013-08-19 Thread Pat Ferrel
There are three things I could work on my free time:

1) test this on a bigger data set gathered from rotten tomatoes, which only has 
B data (movie thumbs up) 
2) begin work on the Solr query and service integration, rather than the 
current loose LucidWorks Search integration.
3) make sure everything is set up for different item spaces in B and A.

Planning to tackle in this order, unless someone speaks up.

 
On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Works on a cluster but have only tested on the trivial test data set. 

On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

OK single action recs are working so output to Solr with only [B'B] and B.

On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

Corrections inline

 On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 I finally got some time to work on this and have a first cut at output to 
 Solr working on the github repo. It only works on 2-action input but I'll 
 have that cleaned up soon so it will work with one action. Solr indexing has 
 not been tested yet and the field names and/or types may need tweaking. 
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A 
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2 in-memory 
 hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
 on each node) but haven't tried yet. It orders items in #2 fields by strength 
 of link, which is the similarity value used in [B'B] or [B'A]. It would be 
 nice to order #1 by recency but there is no provision for passing through 
 timestamps at present so they are ordered by the strength of preference. This 
 is probably not useful and so can be ignored. Ordering by recency might be 
 useful for truncating queries by recency while leaving the training data 
 containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some 
 large-ish sample log files I'll give them a try. Check the sample input files 
 in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine Solr 
 may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one 
 feature that sets it completely apart from the typical NoSQL db. The type of 
 queries you do return fuzzy results--in the very best sense of that word. The 
 most interesting queries are based on similarity to some exemplar. Results 
 are returned in order of similarity strength, not ordered by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first. 
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
 probably at least an alternative to using docs and CSVs to import the data 
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  That would be interesting.
 
 
 
 
 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item 

Re: Setting up a recommender

2013-08-19 Thread Ted Dunning
Pat,

That really sounds great.

I should find some time (who needs sleep) to generate music logs for you as
well.


On Mon, Aug 19, 2013 at 8:31 AM, Pat Ferrel p...@occamsmachete.com wrote:

 There are three things I could work on my free time:

 1) test this on a bigger data set gathered from rotten tomatoes, which
 only has B data (movie thumbs up)
 2) begin work on the Solr query and service integration, rather than the
 current loose LucidWorks Search integration.
 3) make sure everything is set up for different item spaces in B and A.

 Planning to tackle in this order, unless someone speaks up.


 On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Works on a cluster but have only tested on the trivial test data set.

 On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 OK single action recs are working so output to Solr with only [B'B] and B.

 On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Corrections inline

  On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  I finally got some time to work on this and have a first cut at output
 to Solr working on the github repo. It only works on 2-action input but
 I'll have that cleaned up soon so it will work with one action. Solr
 indexing has not been tested yet and the field names and/or types may need
 tweaking.
 
  It takes the result of the previous drop:
  1) DRMs for B (user history or B items action1) and A (user history of A
 items action2)
  2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
  There are two final outputs created using mapreduce but requiring 2
 in-memory hashmaps. I think this will work on a cluster (the hashmaps are
 instantiated on each node) but haven't tried yet. It orders items in #2
 fields by strength of link, which is the similarity value used in [B'B]
 or [B'A]. It would be nice to order #1 by recency but there is no provision
 for passing through timestamps at present so they are ordered by the
 strength of preference. This is probably not useful and so can be ignored.
 Ordering by recency might be useful for truncating queries by recency while
 leaving the training data containing 100% of available history.
 
  1) It joins #1 DRMs to produce a single set of docs in CSV form, which
 looks like this:
  id,history_b,history_a
 u1,iphone ipad,iphone ipad galaxy
  ...
 
  2) it joins #2 DRMs to produce a single set of docs in CSV form, which
 looks like this:
  id,b_b_links,b_a_links
 iphone,iphone ipad,iphone ipad galaxy
  …
 
  It may work on a cluster, I haven't tried yet. As soon as someone has
 some large-ish sample log files I'll give them a try. Check the sample
 input files in the resources dir for format.
 
  https://github.com/pferrel/solr-recommender
 
 
  On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  When I started looking at this I was a bit skeptical. As a Search engine
 Solr may be peerless, but as yet another NoSQL db?
 
  However getting further into this I see one very large benefit. It has
 one feature that sets it completely apart from the typical NoSQL db. The
 type of queries you do return fuzzy results--in the very best sense of that
 word. The most interesting queries are based on similarity to some
 exemplar. Results are returned in order of similarity strength, not ordered
 by a sort field.
 
  Wherever similarity based queries are important I'll look at Solr first.
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's
 probably at least an alternative to using docs and CSVs to import the data
 from Mahout.
 
 
 
  On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  That would be interesting.
 
 
 
 
  On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
  A little digression: Might a Matrix implementation backed by a Solr
 index
  and uses SolrJ for querying help at all for the Solr recommendation
  approach?
 
  It supports multiple fields of String, Text, or boolean flags.
 
  Best
  Gokhan
 
 
  On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with
 three
  fields, id, A item history, and B item history. Other fields could be
  added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great
 to
  have example lines for two actions with or without the same item IDs.
  I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the
 one-item-space
  is actually a problem. It just 

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
When I started looking at this I was a bit skeptical. As a Search engine Solr 
may be peerless, but as yet another NoSQL db?

However getting further into this I see one very large benefit. It has one 
feature that sets it completely apart from the typical NoSQL db. The type of 
queries you do return fuzzy results--in the very best sense of that word. The 
most interesting queries are based on similarity to some exemplar. Results are 
returned in order of similarity strength, not ordered by a sort field.

Wherever similarity based queries are important I'll look at Solr first. SolrJ 
looks like an interesting way to get Solr queries on POJOs. It's probably at 
least an alternative to using docs and CSVs to import the data from Mahout.



On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).
 
 There is a hitch in bringing in the data needed to generate the logs
 since
 that part of MB is not Apache compatible.  I am working on that issue.
 
 Technically, the data is in a massively normalized relational form right
 now, but it isn't terribly hard to denormalize into a form that we need.
 
 
 
 



Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
I finally got some time to work on this and have a first cut at output to Solr 
working on the github repo. It only works on 2-action input but I'll have that 
cleaned up soon so it will work with one action. Solr indexing has not been 
tested yet and the field names and/or types may need tweaking. 

It takes the result of the previous drop:
1) DRMs for B (user history or B items action1) and A (user history of A items 
action2)
2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence

There are two final outputs created using mapreduce but requiring 2 in-memory 
hashmaps. I think this will work on a cluster (the hashmaps are instantiated on 
each node) but haven't tried yet. It orders items in #2 fields by strength of 
link, which is the similarity value used in [B'B] or [B'A]. It would be nice 
to order #1 by recency but there is no provision for passing through timestamps 
at present so they are ordered by the strength of preference. This is probably 
not useful and so can be ignored. Ordering by recency might be useful for 
truncating queries by recency while leaving the training data containing 100% 
of available history.

1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
like this:
id,history_b,history_a
user1,iphone ipad,iphone ipad galaxy
...

2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
like this:
id,b_b_links,b_a_links
u1,iphone ipad,iphone ipad galaxy
…

It may work on a cluster, I haven't tried yet. As soon as someone has some 
large-ish sample log files I'll give them a try. Check the sample input files 
in the resources dir for format.

https://github.com/pferrel/solr-recommender


On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:

When I started looking at this I was a bit skeptical. As a Search engine Solr 
may be peerless, but as yet another NoSQL db?

However getting further into this I see one very large benefit. It has one 
feature that sets it completely apart from the typical NoSQL db. The type of 
queries you do return fuzzy results--in the very best sense of that word. The 
most interesting queries are based on similarity to some exemplar. Results are 
returned in order of similarity strength, not ordered by a sort field.

Wherever similarity based queries are important I'll look at Solr first. SolrJ 
looks like an interesting way to get Solr queries on POJOs. It's probably at 
least an alternative to using docs and CSVs to import the data from Mahout.



On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).
 
 There is a hitch in bringing in the data needed to generate the logs
 since
 that part of MB is not Apache compatible.  I am working on that issue.
 
 Technically, the data is in a massively normalized relational form right
 now, but it isn't terribly hard to denormalize 

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
Corrections inline

 On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 I finally got some time to work on this and have a first cut at output to 
 Solr working on the github repo. It only works on 2-action input but I'll 
 have that cleaned up soon so it will work with one action. Solr indexing has 
 not been tested yet and the field names and/or types may need tweaking. 
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A 
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2 in-memory 
 hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
 on each node) but haven't tried yet. It orders items in #2 fields by strength 
 of link, which is the similarity value used in [B'B] or [B'A]. It would be 
 nice to order #1 by recency but there is no provision for passing through 
 timestamps at present so they are ordered by the strength of preference. This 
 is probably not useful and so can be ignored. Ordering by recency might be 
 useful for truncating queries by recency while leaving the training data 
 containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some 
 large-ish sample log files I'll give them a try. Check the sample input files 
 in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine Solr 
 may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one 
 feature that sets it completely apart from the typical NoSQL db. The type of 
 queries you do return fuzzy results--in the very best sense of that word. The 
 most interesting queries are based on similarity to some exemplar. Results 
 are returned in order of similarity strength, not ordered by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first. 
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
 probably at least an alternative to using docs and CSVs to import the data 
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  That would be interesting.
 
 
 
 
 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).
 
 There is a hitch in bringing in the data needed to generate the logs
 since
 that part of MB is not Apache 

Re: Setting up a recommender

2013-08-13 Thread Pat Ferrel
OK single action recs are working so output to Solr with only [B'B] and B.

On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote:

Corrections inline

 On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 I finally got some time to work on this and have a first cut at output to 
 Solr working on the github repo. It only works on 2-action input but I'll 
 have that cleaned up soon so it will work with one action. Solr indexing has 
 not been tested yet and the field names and/or types may need tweaking. 
 
 It takes the result of the previous drop:
 1) DRMs for B (user history or B items action1) and A (user history of A 
 items action2)
 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence
 
 There are two final outputs created using mapreduce but requiring 2 in-memory 
 hashmaps. I think this will work on a cluster (the hashmaps are instantiated 
 on each node) but haven't tried yet. It orders items in #2 fields by strength 
 of link, which is the similarity value used in [B'B] or [B'A]. It would be 
 nice to order #1 by recency but there is no provision for passing through 
 timestamps at present so they are ordered by the strength of preference. This 
 is probably not useful and so can be ignored. Ordering by recency might be 
 useful for truncating queries by recency while leaving the training data 
 containing 100% of available history.
 
 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,history_b,history_a
u1,iphone ipad,iphone ipad galaxy
 ...
 
 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks 
 like this:
 id,b_b_links,b_a_links
iphone,iphone ipad,iphone ipad galaxy
 …
 
 It may work on a cluster, I haven't tried yet. As soon as someone has some 
 large-ish sample log files I'll give them a try. Check the sample input files 
 in the resources dir for format.
 
 https://github.com/pferrel/solr-recommender
 
 
 On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 When I started looking at this I was a bit skeptical. As a Search engine Solr 
 may be peerless, but as yet another NoSQL db?
 
 However getting further into this I see one very large benefit. It has one 
 feature that sets it completely apart from the typical NoSQL db. The type of 
 queries you do return fuzzy results--in the very best sense of that word. The 
 most interesting queries are based on similarity to some exemplar. Results 
 are returned in order of similarity strength, not ordered by a sort field.
 
 Wherever similarity based queries are important I'll look at Solr first. 
 SolrJ looks like an interesting way to get Solr queries on POJOs. It's 
 probably at least an alternative to using docs and CSVs to import the data 
 from Mahout.
 
 
 
 On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  That would be interesting.
 
 
 
 
 On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:
 
 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?
 
 It supports multiple fields of String, Text, or boolean flags.
 
 Best
 Gokhan
 
 
 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Also a question about user history.
 
 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be
 added
 for users metadata.
 
 Sound correct? This is what I'll do unless someone stops me.
 
 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs.
 I'll
 make sure we can digest it.
 
 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will
 have
 the right content, all I have to do is make sure the right ranks are
 input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of
 items
 in A's item space. This should be a very easy change If my thinking is
 correct.
 
 
 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do
 it
 in the project. Ted may have some data for this from musicbrainz.
 
 
 I am working on this issue now.
 
 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 

Re: Setting up a recommender

2013-08-12 Thread Gokhan Capan
A little digression: Might a Matrix implementation backed by a Solr index
and uses SolrJ for querying help at all for the Solr recommendation
approach?

It supports multiple fields of String, Text, or boolean flags.

Best
Gokhan


On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Also a question about user history.

 I was planning to write these into separate directories so Solr could
 fetch them from different sources but it occurs to me that it would be
 better to join A and B by user ID and output a doc per user ID with three
 fields, id, A item history, and B item history. Other fields could be added
 for users metadata.

 Sound correct? This is what I'll do unless someone stops me.

 On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Once you have a sample or example of what you think the
 log file version will look like, can you post it? It would be great to
 have example lines for two actions with or without the same item IDs. I'll
 make sure we can digest it.

 I thought more about the ingest part and I don't think the one-item-space
 is actually a problem. It just means one item dictionary. A and B will have
 the right content, all I have to do is make sure the right ranks are input
 to the MM,
 Transpose, and RSJ. This in turn is only one extra count of the # of items
 in A's item space. This should be a very easy change If my thinking is
 correct.


 On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  4) To add more metadata to the Solr output will be left to the consumer
  for now. If there is a good data set to use we can illustrate how to do
 it
  in the project. Ted may have some data for this from musicbrainz.


 I am working on this issue now.

 The current state is that I can bring in a bunch of track names and links
 to artist names and so on.  This would provide the basic set of items
 (artists, genres, tracks and tags).

 There is a hitch in bringing in the data needed to generate the logs since
 that part of MB is not Apache compatible.  I am working on that issue.

 Technically, the data is in a massively normalized relational form right
 now, but it isn't terribly hard to denormalize into a form that we need.





Re: Setting up a recommender

2013-08-12 Thread Ted Dunning
Yes.  That would be interesting.




On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote:

 A little digression: Might a Matrix implementation backed by a Solr index
 and uses SolrJ for querying help at all for the Solr recommendation
 approach?

 It supports multiple fields of String, Text, or boolean flags.

 Best
 Gokhan


 On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Also a question about user history.
 
  I was planning to write these into separate directories so Solr could
  fetch them from different sources but it occurs to me that it would be
  better to join A and B by user ID and output a doc per user ID with three
  fields, id, A item history, and B item history. Other fields could be
 added
  for users metadata.
 
  Sound correct? This is what I'll do unless someone stops me.
 
  On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
  Once you have a sample or example of what you think the
  log file version will look like, can you post it? It would be great to
  have example lines for two actions with or without the same item IDs.
 I'll
  make sure we can digest it.
 
  I thought more about the ingest part and I don't think the one-item-space
  is actually a problem. It just means one item dictionary. A and B will
 have
  the right content, all I have to do is make sure the right ranks are
 input
  to the MM,
  Transpose, and RSJ. This in turn is only one extra count of the # of
 items
  in A's item space. This should be a very easy change If my thinking is
  correct.
 
 
  On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
   4) To add more metadata to the Solr output will be left to the consumer
   for now. If there is a good data set to use we can illustrate how to do
  it
   in the project. Ted may have some data for this from musicbrainz.
 
 
  I am working on this issue now.
 
  The current state is that I can bring in a bunch of track names and links
  to artist names and so on.  This would provide the basic set of items
  (artists, genres, tracks and tags).
 
  There is a hitch in bringing in the data needed to generate the logs
 since
  that part of MB is not Apache compatible.  I am working on that issue.
 
  Technically, the data is in a massively normalized relational form right
  now, but it isn't terribly hard to denormalize into a form that we need.
 
 
 



Re: Setting up a recommender

2013-08-07 Thread Ted Dunning
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.


Re: Setting up a recommender

2013-08-07 Thread Pat Ferrel
Once you have a sample or example of what you think the 
log file version will look like, can you post it? It would be great to have 
example lines for two actions with or without the same item IDs. I'll make sure 
we can digest it.

I thought more about the ingest part and I don't think the one-item-space is 
actually a problem. It just means one item dictionary. A and B will have the 
right content, all I have to do is make sure the right ranks are input to the 
MM, 
Transpose, and RSJ. This in turn is only one extra count of the # of items in 
A's item space. This should be a very easy change If my thinking is correct.


On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.



Re: Setting up a recommender

2013-08-07 Thread Pat Ferrel
Also a question about user history.

I was planning to write these into separate directories so Solr could fetch 
them from different sources but it occurs to me that it would be better to join 
A and B by user ID and output a doc per user ID with three fields, id, A item 
history, and B item history. Other fields could be added for users metadata.

Sound correct? This is what I'll do unless someone stops me.

On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote:

Once you have a sample or example of what you think the 
log file version will look like, can you post it? It would be great to have 
example lines for two actions with or without the same item IDs. I'll make sure 
we can digest it.

I thought more about the ingest part and I don't think the one-item-space is 
actually a problem. It just means one item dictionary. A and B will have the 
right content, all I have to do is make sure the right ranks are input to the 
MM, 
Transpose, and RSJ. This in turn is only one extra count of the # of items in 
A's item space. This should be a very easy change If my thinking is correct.


On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote:

On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 4) To add more metadata to the Solr output will be left to the consumer
 for now. If there is a good data set to use we can illustrate how to do it
 in the project. Ted may have some data for this from musicbrainz.


I am working on this issue now.

The current state is that I can bring in a bunch of track names and links
to artist names and so on.  This would provide the basic set of items
(artists, genres, tracks and tags).

There is a hitch in bringing in the data needed to generate the logs since
that part of MB is not Apache compatible.  I am working on that issue.

Technically, the data is in a massively normalized relational form right
now, but it isn't terribly hard to denormalize into a form that we need.




Re: Setting up a recommender

2013-08-06 Thread Pat Ferrel
A note about todays hangout regarding the cross-recommender.

In general it may be good way to think about the current and proposed system as 
two pipelines:

1) a pipeline that takes preference data, turn it into two preference matrices 
in Mahout DRM form and creates [B'B] and [B'A] ideally using LLR Row and 
CrossRowSimilairtyJobs. This generates two DRMs with Mahout Keys and 
VectorWritable(s) with internal numerical Mahout IDs. There is one ID space for 
B and one for A. In the github repo these also create recommendations in Mahout 
form as an Items-based RecommenderJob and XRecommenderJob. This last step is 
not needed when using Solr but may be useful for comparison. These Jobs are all 
mapreduce and closely match the Mahout code and model of calculation. 

2) a pipeline that processes IDs and other metadata contained in the logs. The 
IDs are user IDs in string form as are the Items IDs. But the Items for A 
action may be completely different from B. This cross-recommender ties the two 
together with a generalized notion of significant cooccurrence using by 
executing the #1 pipeline and using the results. These log file IDs are what 
gets written out to Solr. Which IDs is encoded in the two Mahout generated 
DRMs. The pipeline may need to bring along other metadata mined from the logs 
like item descriptions, tags, categories, etc. Note: This is last bit is not 
build in at present but would make Solr queries even better. Also at present A 
and B are assumed to have the same item IDs. This works for purchase+view 
actions and other but not for some cross-actions that would be useful like 
music track listen + tagged category listen - track recommendation or music 
tagged category listen+track listen - category recommendation.

The current action items are:
1) #1 is running and works but eventually needs to be reintegrated with new 
Mahout trunk code--my action item, with Sebastian's help.
2) #2 needs to write the merged DRMs to Solr as one doc per row and 3 fields 
per doc (id, B'B, B'A)--I'm is working on this now.
3) To generalize further we need to account for different ID spaces in #2 and 
I'll take that as an action item.
4) To add more metadata to the Solr output will be left to the consumer for 
now. If there is a good data set to use we can illustrate how to do it in the 
project. Ted may have some data for this from musicbrainz.

Re: Setting up a recommender

2013-08-05 Thread Pat Ferrel
In writing the similarity matrices to Solr there is a bit of a problem. The 
Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I 
know there is no guarantee that the ids of both matrices are in the same 
descending order. 

The easiest solution is to have an index for [B'B] and one for [B'A]. That 
means two or perhaps three queries for cross-recommendations, which is not 
ideal.

First I'm going to create two collections of docs with different field 
ids--this should work and we can merge them later.

Next we can do some m/r to group the docs by id so there is one collection 
(csv) with one line per doc. 

Alternatively it is a possible that the DRMs can be iterated simultaneously, 
which would also solve the problem. It assumes the order in both DRMs is the 
same, descending by Key = item ID. Even if a row is missing in one or the other 
this would work.

Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ 
creates [B'B] and matrix multiply creates [B'A]


On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Yes.  We need two different sets of documents if the row space of the
cross/co-occurrence matrices are different as is the case with A'B and B'B.

This could mean two indexes.

Or a single index with a special field to indicate what type of record you
have.


On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Thanks, well put.
 
 In order to have the ultimate impl with two id spaces for A and B would we
 have to create different docs for A'B and B'B? Since the docs IDs must come
 from A or B? The fields can contain different sets of IDs but the Doc ID
 must be one or the other, right? Doesn't this imply separate indexes for
 the separate A, B item IDs spaces? This is not a question for this first
 cut impl but is a generalization question.
 
 On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 So there is a lot of good discussion here and there were some key ideas.
 
 The first idea is that the *input* to a recommender is on the right in the
 matrix notation.  This refers inherently to the id's on the columns of the
 recommender product (either B'B or B'A).  The columns are defined by the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).
 
 The results are in the row space and are defined by the left hand operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B in
 both cases so the row space is consistent.
 
 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of B.
 The fields of the documents will necessarily include the following:
 
 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the column
 space of A where this row  of llr-filter(B'A) contains a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the column
 space of B ...
 
 
 The following operations are now single queries:
 
 get an item where id = x
  query is [id:x]
 
 recommend based on behavior with regard to A items and actions h_a
  query is [b-a-links: h_a]
 
 recommend based on behavior with regard to B items and actions h_b
  query is [b-b-links: h_b]
 
 recommend based on a single item with id = x
   query is [b-b-links: x]
 
 recommend based on composite behavior composed of h_a and h_b
   query is [b-a-links: h_a b-b-links: h_b]
 
 Does this make sense by being more explicit?
 
 Now, it is pretty clear that we could have an index of A objects as well
 but the link fields would have to be a-a-links and a-b-links, of course.
 
 
 
 
 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.
 
 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
 I am planning on sitting in as flaky connection allows.
 On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 We doing a hangout at 2 on the Solr recommender?
 
 
 
 
 



Re: Setting up a recommender

2013-08-05 Thread Ted Dunning
A quick map-reduce program should be able to join these matrices and
produce documents ready to index.


On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote:

 In writing the similarity matrices to Solr there is a bit of a problem.
 The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far
 as I know there is no guarantee that the ids of both matrices are in the
 same descending order.

 The easiest solution is to have an index for [B'B] and one for [B'A]. That
 means two or perhaps three queries for cross-recommendations, which is not
 ideal.

 First I'm going to create two collections of docs with different field
 ids--this should work and we can merge them later.

 Next we can do some m/r to group the docs by id so there is one collection
 (csv) with one line per doc.

 Alternatively it is a possible that the DRMs can be iterated
 simultaneously, which would also solve the problem. It assumes the order in
 both DRMs is the same, descending by Key = item ID. Even if a row is
 missing in one or the other this would work.

 Does anyone know if the DRMs are guaranteed to have row ordering by Key?
 RSJ creates [B'B] and matrix multiply creates [B'A]


 On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Yes.  We need two different sets of documents if the row space of the
 cross/co-occurrence matrices are different as is the case with A'B and B'B.

 This could mean two indexes.

 Or a single index with a special field to indicate what type of record you
 have.


 On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

  Thanks, well put.
 
  In order to have the ultimate impl with two id spaces for A and B would
 we
  have to create different docs for A'B and B'B? Since the docs IDs must
 come
  from A or B? The fields can contain different sets of IDs but the Doc ID
  must be one or the other, right? Doesn't this imply separate indexes for
  the separate A, B item IDs spaces? This is not a question for this first
  cut impl but is a generalization question.
 
  On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  So there is a lot of good discussion here and there were some key ideas.
 
  The first idea is that the *input* to a recommender is on the right in
 the
  matrix notation.  This refers inherently to the id's on the columns of
 the
  recommender product (either B'B or B'A).  The columns are defined by the
  right hand element of the product (either B or A in the B'B and B'A
  respectively).
 
  The results are in the row space and are defined by the left hand operand
  of the product.  IN the case of B'A and B'B, the left hand operand is B
 in
  both cases so the row space is consistent.
 
  In order to implement this in a search engine, we need documents that
  correspond to rows of B'A or B'B.  These are the same as the columns of
 B.
  The fields of the documents will necessarily include the following:
 
  id: the column id from B corresponding to this item
  description: presentation info ... yada yada
  b-a-links: contents of this row of B'A expressed as id's from the column
  space of A where this row  of llr-filter(B'A) contains a
  non-zero value.
  b-b-links: contents of this row of B'B expressed as id's from the column
  space of B ...
 
 
  The following operations are now single queries:
 
  get an item where id = x
   query is [id:x]
 
  recommend based on behavior with regard to A items and actions h_a
   query is [b-a-links: h_a]
 
  recommend based on behavior with regard to B items and actions h_b
   query is [b-b-links: h_b]
 
  recommend based on a single item with id = x
query is [b-b-links: x]
 
  recommend based on composite behavior composed of h_a and h_b
query is [b-a-links: h_a b-b-links: h_b]
 
  Does this make sense by being more explicit?
 
  Now, it is pretty clear that we could have an index of A objects as well
  but the link fields would have to be a-a-links and a-b-links, of course.
 
 
 
 
  On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  Assuming Ted needs to call it, not sure if an invite has gone out, I
  haven't seen one.
 
  On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
  I am planning on sitting in as flaky connection allows.
  On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  We doing a hangout at 2 on the Solr recommender?
 
 
 
 
 




Re: Setting up a recommender

2013-08-05 Thread Sebastian Schelter
If you use the same partitioning and number of reducers for creating the
outputs, the output should have the same number of sequence files and each
sequence file should have the same keys in descending order. I don't
understand why the ordering is a problem, can we not store the row index as
a field in solr?

2013/8/5 Ted Dunning ted.dunn...@gmail.com

 A quick map-reduce program should be able to join these matrices and
 produce documents ready to index.


 On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote:

  In writing the similarity matrices to Solr there is a bit of a problem.
  The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
 far
  as I know there is no guarantee that the ids of both matrices are in the
  same descending order.
 
  The easiest solution is to have an index for [B'B] and one for [B'A].
 That
  means two or perhaps three queries for cross-recommendations, which is
 not
  ideal.
 
  First I'm going to create two collections of docs with different field
  ids--this should work and we can merge them later.
 
  Next we can do some m/r to group the docs by id so there is one
 collection
  (csv) with one line per doc.
 
  Alternatively it is a possible that the DRMs can be iterated
  simultaneously, which would also solve the problem. It assumes the order
 in
  both DRMs is the same, descending by Key = item ID. Even if a row is
  missing in one or the other this would work.
 
  Does anyone know if the DRMs are guaranteed to have row ordering by Key?
  RSJ creates [B'B] and matrix multiply creates [B'A]
 
 
  On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  We need two different sets of documents if the row space of the
  cross/co-occurrence matrices are different as is the case with A'B and
 B'B.
 
  This could mean two indexes.
 
  Or a single index with a special field to indicate what type of record
 you
  have.
 
 
  On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
   Thanks, well put.
  
   In order to have the ultimate impl with two id spaces for A and B would
  we
   have to create different docs for A'B and B'B? Since the docs IDs must
  come
   from A or B? The fields can contain different sets of IDs but the Doc
 ID
   must be one or the other, right? Doesn't this imply separate indexes
 for
   the separate A, B item IDs spaces? This is not a question for this
 first
   cut impl but is a generalization question.
  
   On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
  
   So there is a lot of good discussion here and there were some key
 ideas.
  
   The first idea is that the *input* to a recommender is on the right in
  the
   matrix notation.  This refers inherently to the id's on the columns of
  the
   recommender product (either B'B or B'A).  The columns are defined by
 the
   right hand element of the product (either B or A in the B'B and B'A
   respectively).
  
   The results are in the row space and are defined by the left hand
 operand
   of the product.  IN the case of B'A and B'B, the left hand operand is B
  in
   both cases so the row space is consistent.
  
   In order to implement this in a search engine, we need documents that
   correspond to rows of B'A or B'B.  These are the same as the columns of
  B.
   The fields of the documents will necessarily include the following:
  
   id: the column id from B corresponding to this item
   description: presentation info ... yada yada
   b-a-links: contents of this row of B'A expressed as id's from the
 column
   space of A where this row  of llr-filter(B'A) contains
 a
   non-zero value.
   b-b-links: contents of this row of B'B expressed as id's from the
 column
   space of B ...
  
  
   The following operations are now single queries:
  
   get an item where id = x
query is [id:x]
  
   recommend based on behavior with regard to A items and actions h_a
query is [b-a-links: h_a]
  
   recommend based on behavior with regard to B items and actions h_b
query is [b-b-links: h_b]
  
   recommend based on a single item with id = x
 query is [b-b-links: x]
  
   recommend based on composite behavior composed of h_a and h_b
 query is [b-a-links: h_a b-b-links: h_b]
  
   Does this make sense by being more explicit?
  
   Now, it is pretty clear that we could have an index of A objects as
 well
   but the link fields would have to be a-a-links and a-b-links, of
 course.
  
  
  
  
   On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
  
   Assuming Ted needs to call it, not sure if an invite has gone out, I
   haven't seen one.
  
   On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
  
   I am planning on sitting in as flaky connection allows.
   On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
  
   We doing a hangout at 2 on the Solr recommender?
  
  
  
  
  
 
 



Re: Setting up a recommender

2013-08-05 Thread Pat Ferrel
I think m/r join is the best solution, too many assumptions otherwise. I 
thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. Is 
there a good example to start from in Mahout? 

Yes, one id field per doc. The problem is not storing, it is joining rows from 
two DRMs by simple iteration.

On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote:

If you use the same partitioning and number of reducers for creating the
outputs, the output should have the same number of sequence files and each
sequence file should have the same keys in descending order. I don't
understand why the ordering is a problem, can we not store the row index as
a field in solr?

2013/8/5 Ted Dunning ted.dunn...@gmail.com

 A quick map-reduce program should be able to join these matrices and
 produce documents ready to index.
 
 
 On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 In writing the similarity matrices to Solr there is a bit of a problem.
 The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
 far
 as I know there is no guarantee that the ids of both matrices are in the
 same descending order.
 
 The easiest solution is to have an index for [B'B] and one for [B'A].
 That
 means two or perhaps three queries for cross-recommendations, which is
 not
 ideal.
 
 First I'm going to create two collections of docs with different field
 ids--this should work and we can merge them later.
 
 Next we can do some m/r to group the docs by id so there is one
 collection
 (csv) with one line per doc.
 
 Alternatively it is a possible that the DRMs can be iterated
 simultaneously, which would also solve the problem. It assumes the order
 in
 both DRMs is the same, descending by Key = item ID. Even if a row is
 missing in one or the other this would work.
 
 Does anyone know if the DRMs are guaranteed to have row ordering by Key?
 RSJ creates [B'B] and matrix multiply creates [B'A]
 
 
 On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Yes.  We need two different sets of documents if the row space of the
 cross/co-occurrence matrices are different as is the case with A'B and
 B'B.
 
 This could mean two indexes.
 
 Or a single index with a special field to indicate what type of record
 you
 have.
 
 
 On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
 Thanks, well put.
 
 In order to have the ultimate impl with two id spaces for A and B would
 we
 have to create different docs for A'B and B'B? Since the docs IDs must
 come
 from A or B? The fields can contain different sets of IDs but the Doc
 ID
 must be one or the other, right? Doesn't this imply separate indexes
 for
 the separate A, B item IDs spaces? This is not a question for this
 first
 cut impl but is a generalization question.
 
 On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 So there is a lot of good discussion here and there were some key
 ideas.
 
 The first idea is that the *input* to a recommender is on the right in
 the
 matrix notation.  This refers inherently to the id's on the columns of
 the
 recommender product (either B'B or B'A).  The columns are defined by
 the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).
 
 The results are in the row space and are defined by the left hand
 operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B
 in
 both cases so the row space is consistent.
 
 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of
 B.
 The fields of the documents will necessarily include the following:
 
 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the
 column
 space of A where this row  of llr-filter(B'A) contains
 a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the
 column
 space of B ...
 
 
 The following operations are now single queries:
 
 get an item where id = x
 query is [id:x]
 
 recommend based on behavior with regard to A items and actions h_a
 query is [b-a-links: h_a]
 
 recommend based on behavior with regard to B items and actions h_b
 query is [b-b-links: h_b]
 
 recommend based on a single item with id = x
  query is [b-b-links: x]
 
 recommend based on composite behavior composed of h_a and h_b
  query is [b-a-links: h_a b-b-links: h_b]
 
 Does this make sense by being more explicit?
 
 Now, it is pretty clear that we could have an index of A objects as
 well
 but the link fields would have to be a-a-links and a-b-links, of
 course.
 
 
 
 
 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com
 wrote:
 
 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.
 
 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com 

Re: Setting up a recommender

2013-08-05 Thread Sebastian Schelter
I still don't understand why we need to rely on docids. If we simply index
that row A is similar to rows B, C and D that should be fine, or am I wrong?

2013/8/5 Pat Ferrel p...@occamsmachete.com

 I think m/r join is the best solution, too many assumptions otherwise. I
 thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r.
 Is there a good example to start from in Mahout?

 Yes, one id field per doc. The problem is not storing, it is joining rows
 from two DRMs by simple iteration.

 On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote:

 If you use the same partitioning and number of reducers for creating the
 outputs, the output should have the same number of sequence files and each
 sequence file should have the same keys in descending order. I don't
 understand why the ordering is a problem, can we not store the row index as
 a field in solr?

 2013/8/5 Ted Dunning ted.dunn...@gmail.com

  A quick map-reduce program should be able to join these matrices and
  produce documents ready to index.
 
 
  On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
  In writing the similarity matrices to Solr there is a bit of a problem.
  The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
  far
  as I know there is no guarantee that the ids of both matrices are in the
  same descending order.
 
  The easiest solution is to have an index for [B'B] and one for [B'A].
  That
  means two or perhaps three queries for cross-recommendations, which is
  not
  ideal.
 
  First I'm going to create two collections of docs with different field
  ids--this should work and we can merge them later.
 
  Next we can do some m/r to group the docs by id so there is one
  collection
  (csv) with one line per doc.
 
  Alternatively it is a possible that the DRMs can be iterated
  simultaneously, which would also solve the problem. It assumes the order
  in
  both DRMs is the same, descending by Key = item ID. Even if a row is
  missing in one or the other this would work.
 
  Does anyone know if the DRMs are guaranteed to have row ordering by Key?
  RSJ creates [B'B] and matrix multiply creates [B'A]
 
 
  On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  Yes.  We need two different sets of documents if the row space of the
  cross/co-occurrence matrices are different as is the case with A'B and
  B'B.
 
  This could mean two indexes.
 
  Or a single index with a special field to indicate what type of record
  you
  have.
 
 
  On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
  wrote:
 
  Thanks, well put.
 
  In order to have the ultimate impl with two id spaces for A and B would
  we
  have to create different docs for A'B and B'B? Since the docs IDs must
  come
  from A or B? The fields can contain different sets of IDs but the Doc
  ID
  must be one or the other, right? Doesn't this imply separate indexes
  for
  the separate A, B item IDs spaces? This is not a question for this
  first
  cut impl but is a generalization question.
 
  On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  So there is a lot of good discussion here and there were some key
  ideas.
 
  The first idea is that the *input* to a recommender is on the right in
  the
  matrix notation.  This refers inherently to the id's on the columns of
  the
  recommender product (either B'B or B'A).  The columns are defined by
  the
  right hand element of the product (either B or A in the B'B and B'A
  respectively).
 
  The results are in the row space and are defined by the left hand
  operand
  of the product.  IN the case of B'A and B'B, the left hand operand is B
  in
  both cases so the row space is consistent.
 
  In order to implement this in a search engine, we need documents that
  correspond to rows of B'A or B'B.  These are the same as the columns of
  B.
  The fields of the documents will necessarily include the following:
 
  id: the column id from B corresponding to this item
  description: presentation info ... yada yada
  b-a-links: contents of this row of B'A expressed as id's from the
  column
  space of A where this row  of llr-filter(B'A) contains
  a
  non-zero value.
  b-b-links: contents of this row of B'B expressed as id's from the
  column
  space of B ...
 
 
  The following operations are now single queries:
 
  get an item where id = x
  query is [id:x]
 
  recommend based on behavior with regard to A items and actions h_a
  query is [b-a-links: h_a]
 
  recommend based on behavior with regard to B items and actions h_b
  query is [b-b-links: h_b]
 
  recommend based on a single item with id = x
   query is [b-b-links: x]
 
  recommend based on composite behavior composed of h_a and h_b
   query is [b-a-links: h_a b-b-links: h_b]
 
  Does this make sense by being more explicit?
 
  Now, it is pretty clear that we could have an index of A objects as
  well
  but the 

Re: Setting up a recommender

2013-08-05 Thread Ted Dunning
Sebastian,

There needs to be a join of the two row similarity matrices to form
documents.

Pat,

What about just updating the document with the fields?  Have three passes.
 Pass 1 puts the normal meta-data for the item in place.  Pass2 updates
with data from B'B.  Pass 3 udpates with data from B'A.

This will cause the entire index to be rewritten more than necessary, but
it should be fast enough to be a non-issue.

On other fronts, I got musicbrainz downloaded over the weekend and have
figured out the schema enough so that I think I can produce recording,
artist and tag information.  From that, I can simulate user behavior and
produce logs to push into the demo system.  That will allow realistic scale
and will allow users to explore the system in terms that they understand.

There is still a question of whether we can redistribute the musicbrainz
data, but I think I can arrange it so that anybody who wants to run the
demo will just download the necessary data themselves.  I may host a
derived data product myself to simplify that process.



On Mon, Aug 5, 2013 at 10:59 AM, Sebastian Schelter s...@apache.org wrote:

 I still don't understand why we need to rely on docids. If we simply index
 that row A is similar to rows B, C and D that should be fine, or am I
 wrong?

 2013/8/5 Pat Ferrel p...@occamsmachete.com

  I think m/r join is the best solution, too many assumptions otherwise. I
  thought Ted wanted a non-m/r implementation, but oh, well, mostly
 non-m/r.
  Is there a good example to start from in Mahout?
 
  Yes, one id field per doc. The problem is not storing, it is joining rows
  from two DRMs by simple iteration.
 
  On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote:
 
  If you use the same partitioning and number of reducers for creating the
  outputs, the output should have the same number of sequence files and
 each
  sequence file should have the same keys in descending order. I don't
  understand why the ordering is a problem, can we not store the row index
 as
  a field in solr?
 
  2013/8/5 Ted Dunning ted.dunn...@gmail.com
 
   A quick map-reduce program should be able to join these matrices and
   produce documents ready to index.
  
  
   On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com
  wrote:
  
   In writing the similarity matrices to Solr there is a bit of a
 problem.
   The Matrices exist in two DRMs. The rows correspond to the doc IDs. As
   far
   as I know there is no guarantee that the ids of both matrices are in
 the
   same descending order.
  
   The easiest solution is to have an index for [B'B] and one for [B'A].
   That
   means two or perhaps three queries for cross-recommendations, which is
   not
   ideal.
  
   First I'm going to create two collections of docs with different field
   ids--this should work and we can merge them later.
  
   Next we can do some m/r to group the docs by id so there is one
   collection
   (csv) with one line per doc.
  
   Alternatively it is a possible that the DRMs can be iterated
   simultaneously, which would also solve the problem. It assumes the
 order
   in
   both DRMs is the same, descending by Key = item ID. Even if a row is
   missing in one or the other this would work.
  
   Does anyone know if the DRMs are guaranteed to have row ordering by
 Key?
   RSJ creates [B'B] and matrix multiply creates [B'A]
  
  
   On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  
   Yes.  We need two different sets of documents if the row space of the
   cross/co-occurrence matrices are different as is the case with A'B and
   B'B.
  
   This could mean two indexes.
  
   Or a single index with a special field to indicate what type of record
   you
   have.
  
  
   On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com
   wrote:
  
   Thanks, well put.
  
   In order to have the ultimate impl with two id spaces for A and B
 would
   we
   have to create different docs for A'B and B'B? Since the docs IDs
 must
   come
   from A or B? The fields can contain different sets of IDs but the Doc
   ID
   must be one or the other, right? Doesn't this imply separate indexes
   for
   the separate A, B item IDs spaces? This is not a question for this
   first
   cut impl but is a generalization question.
  
   On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  
   So there is a lot of good discussion here and there were some key
   ideas.
  
   The first idea is that the *input* to a recommender is on the right
 in
   the
   matrix notation.  This refers inherently to the id's on the columns
 of
   the
   recommender product (either B'B or B'A).  The columns are defined by
   the
   right hand element of the product (either B or A in the B'B and B'A
   respectively).
  
   The results are in the row space and are defined by the left hand
   operand
   of the product.  IN the case of B'A and B'B, the left hand operand
 is B
   in
   both cases so the row 

Re: Setting up a recommender

2013-08-05 Thread Pat Ferrel
Yeah thought of that one too but it still requires each be ordered by Key, in 
which case simultaneous iteration works in one pass I think.

If the DRMs are always sorted by Key you can iterate through each at the same 
time, writing only when you have both fields or know there is a field missing 
from one DRM. If you get the same key you write a combined doc, if you have 
different ones, write out one sided until it catches up to the other.

Every DRM I've examined seems to be ordered by key and I assume that is not an 
artifact of seqdumper. I'm using SequenceFileDirIterator so the part file 
splits aren't a problem.

A m/r join is pretty simple too but I'll go with non-m/r unless there is a 
problem above.

BTW the schema for the Solr csv is:
id,b_b_links,b_a_links
item1,itemX itemY,itemZ

am I missing some normal metadata?

 On Aug 5, 2013, at 11:05 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 What about just updating the document with the fields?  Have three passes.
 Pass 1 puts the normal meta-data for the item in place.  Pass2 updates
 with data from B'B.  Pass 3 udpates with data from B'A.
 
 This will cause the entire index to be rewritten more than necessary, but
 it should be fast enough to be a non-issue.
 



Re: Setting up a recommender

2013-08-05 Thread Ted Dunning
On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Yeah thought of that one too but it still requires each be ordered by Key,
 in which case simultaneous iteration works in one pass I think.


Multipass does not require ordering by key.  Solr documents can be updated
in any order.


 If the DRMs are always sorted by Key you can iterate through each at the
 same time, writing only when you have both fields or know there is a field
 missing from one DRM. If you get the same key you write a combined doc, if
 you have different ones, write out one sided until it catches up to the
 other.


Yes.  Merge will work when files are ordered and split consistently.  I
don't think we should be making that assumption.


 Every DRM I've examined seems to be ordered by key and I assume that is
 not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part
 file splits aren't a problem.


But with the co- and cross- occurrence stuff, file splits could be a
problem.


 A m/r join is pretty simple too but I'll go with non-m/r unless there is a
 problem above.


The simplest join is to use Solr updates.  This would require a minimal
amount of programming, but less than writing a merge program.


 BTW the schema for the Solr csv is:
 id,b_b_links,b_a_links
 item1,itemX itemY,itemZ

 am I missing some normal metadata?


An item description is nice.


Re: Setting up a recommender

2013-08-05 Thread Johannes Schulte
we have a cross recommender in production for about 3 month now, with the
difference that we use lucene to build indices from map reduce directly
plus we do the same thing for 30+ customers, most of them with different
input data structure (field names, values).

we had something similar before (lucene, multiple cross relations) but also
used the similarity score (llr) with a custom similarity and payloads but
switched tp pure tedism after some helpful comments here. therefore i
read this thread with a lot of interest.

what i can add from my experiences:

1. i find it way easier to not talk about in this in matrix multiplication
language but with contigency tables ( a and b, a and not b, not a and b,
not a and not b), and also find the usage of the classical mahout
similarity jobs hard. this is probably because of my basic matrix math
skills, but also because using matrices leads to id usage and often the
extracted items are text (search term, country, page section). thinking of
this as related terms automatically gives a document view on the item to
be recommended (the lucene doc) where description, name and everything is
also just a field.

2. when doing a simple table it's just cooccurrences, marginals and totals.
since the dimension of marginals is often not too big (items, browser,
terms), we right now accumulate the counts in memory. maybe the
 RowSimilarityJob is working the same way. This can be changed to a
different implementaton like on disk hash table or even count min sketch,
if the number of items is too large. Main point is that the counting of
marginals can be done on the fly when emitting all ooccurrences.

3. above in the thread there was a tip on approaching similarity scores
with repeating terms. payloads are a better way for this and with lucene
4's doc values capability, there shouldn't be any mahout similarity not
expressible by a lucene similarity. maybe it would be helpful to provide a
lucene delivery system also for the classic mahout recommender package.
it adds soo many possibilities for filtering and takes away a lot of point
like caching etc.

4. a big question is the frequency of rebuilding. while the relations can
often stay untouched for a day, the item data may change way more often
(item churn, new items). it is therefore beneficial to separate those and
have the possibility to rebuild the final index without calulating all
similarities again (for very critical things this often leads to a lucene
filter querying some external source to build up a lucene filter that
restricts the index)

besides that, i am very happy to see the ongoing effort on this topic and
hope that i can contribute with something someday.

Cheers,
Johannes




On Mon, Aug 5, 2013 at 10:27 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  Yeah thought of that one too but it still requires each be ordered by
 Key,
  in which case simultaneous iteration works in one pass I think.
 

 Multipass does not require ordering by key.  Solr documents can be updated
 in any order.


  If the DRMs are always sorted by Key you can iterate through each at the
  same time, writing only when you have both fields or know there is a
 field
  missing from one DRM. If you get the same key you write a combined doc,
 if
  you have different ones, write out one sided until it catches up to the
  other.
 

 Yes.  Merge will work when files are ordered and split consistently.  I
 don't think we should be making that assumption.


  Every DRM I've examined seems to be ordered by key and I assume that is
  not an artifact of seqdumper. I'm using SequenceFileDirIterator so the
 part
  file splits aren't a problem.
 

 But with the co- and cross- occurrence stuff, file splits could be a
 problem.


  A m/r join is pretty simple too but I'll go with non-m/r unless there is
 a
  problem above.
 

 The simplest join is to use Solr updates.  This would require a minimal
 amount of programming, but less than writing a merge program.


  BTW the schema for the Solr csv is:
  id,b_b_links,b_a_links
  item1,itemX itemY,itemZ
 
  am I missing some normal metadata?
 

 An item description is nice.



Re: Setting up a recommender

2013-08-03 Thread Ted Dunning
Yes.  We need two different sets of documents if the row space of the
cross/co-occurrence matrices are different as is the case with A'B and B'B.

This could mean two indexes.

Or a single index with a special field to indicate what type of record you
have.


On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Thanks, well put.

 In order to have the ultimate impl with two id spaces for A and B would we
 have to create different docs for A'B and B'B? Since the docs IDs must come
 from A or B? The fields can contain different sets of IDs but the Doc ID
 must be one or the other, right? Doesn't this imply separate indexes for
 the separate A, B item IDs spaces? This is not a question for this first
 cut impl but is a generalization question.

 On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 So there is a lot of good discussion here and there were some key ideas.

 The first idea is that the *input* to a recommender is on the right in the
 matrix notation.  This refers inherently to the id's on the columns of the
 recommender product (either B'B or B'A).  The columns are defined by the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).

 The results are in the row space and are defined by the left hand operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B in
 both cases so the row space is consistent.

 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of B.
 The fields of the documents will necessarily include the following:

 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the column
 space of A where this row  of llr-filter(B'A) contains a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the column
 space of B ...


 The following operations are now single queries:

 get an item where id = x
   query is [id:x]

 recommend based on behavior with regard to A items and actions h_a
   query is [b-a-links: h_a]

 recommend based on behavior with regard to B items and actions h_b
   query is [b-b-links: h_b]

 recommend based on a single item with id = x
query is [b-b-links: x]

 recommend based on composite behavior composed of h_a and h_b
query is [b-a-links: h_a b-b-links: h_b]

 Does this make sense by being more explicit?

 Now, it is pretty clear that we could have an index of A objects as well
 but the link fields would have to be a-a-links and a-b-links, of course.




 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Assuming Ted needs to call it, not sure if an invite has gone out, I
  haven't seen one.
 
  On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
  I am planning on sitting in as flaky connection allows.
  On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
  We doing a hangout at 2 on the Solr recommender?
 
 
 




Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
I put some thought into this (actually I slept on it) and I think the answer is 
in the math.

-- A = matrix of action2 by user, used for cross-action recommendations, for 
instance action2 = views.
-- B = matrix of action1 by user, these are the primary recommenders actions, 
for instance action1 = purchases.
-- H_a1 = all user history of action1 in column vectors. This may be all 
action1's recorded and so may = B' or it may have truncated history to get more 
recent activity in recs.
-- H_a2 = all user history of action2 in column vectors. This may be all 
action2's recorded and so may = A' or it may have truncated history to get more 
recent activity in recs.
-- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for 
action1.
-- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was 
also an action1. recommendation are for action1. 
-- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are 
weighted to optimize results.

The query on [B'A] will be column vectors from  H_a2. Each is a user's  history 
of action2 on A items. That is if there were different items in A than B then 
the query would be comprised of those items and against the field that contains 
those items. This brings up a bunch of other questions but for now we do not 
have separate items.

It illustrates the fact that the query is user history of action2 so the items 
(though they have the same ID space in this case) should be from A or there 
would be no hits.

Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows 
are the same as columns.

The confusion may come from the fact that Ted's mental model does not have the 
same items for both A and B. So the document ID cannot = item ID since the docs 
contain items from both item ID spaces. In which case I don't know why they 
would be in the same doc at all but that is another discussion. This model does 
not allow us to fetch a doc by ID.

But in our case since we have the same IDs in A and B we can put them in a doc 
of ID=item ID, the field similair_items can contain items from B 
similarityMatrix rows since they are the same as columns, the 
cross_action_similar_items field will contain columns from [B'A]

This may just be mental looping--sleep only work about 50% of the time for me 
so maybe someone else can check this reasoning. Have a look at the data here 
https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx


On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Yes, storing the similar_items in a field, cross_action_similar_items in 
another field all on the same doc ided by item ID. Agree that there may be 
other fields.

Storing the rows of [B'B] is ok because it's symmetric. However we did talk 
about the [B'A] case and I thought we agreed to store the rows there too 
because they were from Bs items. This was the discussion about having different 
items for cross actions. The excerpt below is Ted responding to my question. So 
do we want the columns of [B'A]? It's only a transpose away.


 On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:
 [B'A] =
iphone  ipadnexus   galaxy  surface
 iphone  2   2   2   1   0
 ipad2   2   2   1   0
 nexus   1   1   1   1   0
 galaxy  1   1   1   1   0
 surface 0   0   0   0   1
 
 The rows are what we want from [B'A] since the row items are from B, right?
 
 Yes.
 
 It is easier to understand if you have different kinds of items as well as 
 different actions.  For instance, suppose that you have user x query terms 
 (A) and user x device (B).  B'A is then device x term so that there is a row 
 per device and the row contains terms.  This is good when searching for 
 devices using terms.


Talking about getting the actual doc field values, which will include the 
similar_items field and other metadata. The actual ids in the similar_items 
field work well for anonymous/no-history recs but maybe there is a second query 
or fetch that I'm missing? I assumed that a fetch of the doc and it's fields  
by item ID was as fast a way to do this as possible. If there is some way to 
get the same result by doing a query that is faster, I'm all for it?

Can do tomorrow at 2.



Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
Apologies for thrashing--definitely doing some mental looping but look at the 
cross-similarities on the Template sheet of the Excel file. The rows of [B'A] 
intuitively look best.

Specifically there was a user who viewed the Surface and Nexus but the columns 
do not account for that, the rows do.

Going from rows to columns is the trivial addition of a transpose so I'm going 
to go ahead with rows for now. This affects the cross_action_similar_items and 
so only the cross-recommender part of the whole.

On Aug 2, 2013, at 8:00 AM, Pat Ferrel pat.fer...@gmail.com wrote:

I put some thought into this (actually I slept on it) and I think the answer is 
in the math.

-- A = matrix of action2 by user, used for cross-action recommendations, for 
instance action2 = views.
-- B = matrix of action1 by user, these are the primary recommenders actions, 
for instance action1 = purchases.
-- H_a1 = all user history of action1 in column vectors. This may be all 
action1's recorded and so may = B' or it may have truncated history to get more 
recent activity in recs.
-- H_a2 = all user history of action2 in column vectors. This may be all 
action2's recorded and so may = A' or it may have truncated history to get more 
recent activity in recs.
-- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for 
action1.
-- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was 
also an action1. recommendation are for action1. 
-- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are 
weighted to optimize results.

The query on [B'A] will be column vectors from  H_a2. Each is a user's  history 
of action2 on A items. That is if there were different items in A than B then 
the query would be comprised of those items and against the field that contains 
those items. This brings up a bunch of other questions but for now we do not 
have separate items.

It illustrates the fact that the query is user history of action2 so the items 
(though they have the same ID space in this case) should be from A or there 
would be no hits.

Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows 
are the same as columns.

The confusion may come from the fact that Ted's mental model does not have the 
same items for both A and B. So the document ID cannot = item ID since the docs 
contain items from both item ID spaces. In which case I don't know why they 
would be in the same doc at all but that is another discussion. This model does 
not allow us to fetch a doc by ID.

But in our case since we have the same IDs in A and B we can put them in a doc 
of ID=item ID, the field similair_items can contain items from B 
similarityMatrix rows since they are the same as columns, the 
cross_action_similar_items field will contain columns from [B'A]

This may just be mental looping--sleep only work about 50% of the time for me 
so maybe someone else can check this reasoning. Have a look at the data here 
https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx


On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Yes, storing the similar_items in a field, cross_action_similar_items in 
another field all on the same doc ided by item ID. Agree that there may be 
other fields.

Storing the rows of [B'B] is ok because it's symmetric. However we did talk 
about the [B'A] case and I thought we agreed to store the rows there too 
because they were from Bs items. This was the discussion about having different 
items for cross actions. The excerpt below is Ted responding to my question. So 
do we want the columns of [B'A]? It's only a transpose away.


 On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:
 [B'A] =
   iphone  ipadnexus   galaxy  surface
 iphone  2   2   2   1   0
 ipad2   2   2   1   0
 nexus   1   1   1   1   0
 galaxy  1   1   1   1   0
 surface 0   0   0   0   1
 
 The rows are what we want from [B'A] since the row items are from B, right?
 
 Yes.
 
 It is easier to understand if you have different kinds of items as well as 
 different actions.  For instance, suppose that you have user x query terms 
 (A) and user x device (B).  B'A is then device x term so that there is a row 
 per device and the row contains terms.  This is good when searching for 
 devices using terms.


Talking about getting the actual doc field values, which will include the 
similar_items field and other metadata. The actual ids in the similar_items 
field work well for anonymous/no-history recs but maybe there is a second query 
or fetch that I'm missing? I assumed that a fetch of the doc and it's fields  
by item ID was as fast a way to do this as possible. If there is some way to 
get the same result by doing a query that is faster, I'm all for it?

Can do tomorrow at 2.




Re: Setting up a recommender

2013-08-02 Thread B Lyon
I think the sheet is very helpful.

 I was wondering about having at least one of the examples be where the
actions deal with completely different things to maybe make it easier for
newbies like me to grok the main points: purchases of items of type blah
and views of videos, say.  I think the input file has the same setup etc.

I don't get the issue/questions that come up when we do have separate
items.  And I thought Ted mentioned at one point that the weighting of
recommendation vectors might not be necessary based on some kind of solr
magic, but I have no idea what that is.

Btw, i was already thinking of doing something for my own
clarification/edification that is similar to your spreadsheet, but would be
a web page where a mouseover on one piece highlights the other pieces that
generated it... E.g. The way the links in this pagerank explorer highlight
the relevant portions of the google matrix (
https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are lots
of other different pieces here of course,
 but show connections soup-to-nuts as much as possible.



On Friday, August 2, 2013, Pat Ferrel wrote:

 I put some thought into this (actually I slept on it) and I think the
 answer is in the math.

 -- A = matrix of action2 by user, used for cross-action recommendations,
 for instance action2 = views.
 -- B = matrix of action1 by user, these are the primary recommenders
 actions, for instance action1 = purchases.
 -- H_a1 = all user history of action1 in column vectors. This may be all
 action1's recorded and so may = B' or it may have truncated history to get
 more recent activity in recs.
 -- H_a2 = all user history of action2 in column vectors. This may be all
 action2's recorded and so may = A' or it may have truncated history to get
 more recent activity in recs.
 -- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for
 action1.
 -- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there
 was also an action1. recommendation are for action1.
 -- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they
 are weighted to optimize results.

 The query on [B'A] will be column vectors from  H_a2. Each is a user's
  history of action2 on A items. That is if there were different items in A
 than B then the query would be comprised of those items and against the
 field that contains those items. This brings up a bunch of other questions
 but for now we do not have separate items.

 It illustrates the fact that the query is user history of action2 so the
 items (though they have the same ID space in this case) should be from A or
 there would be no hits.

 Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so
 rows are the same as columns.

 The confusion may come from the fact that Ted's mental model does not have
 the same items for both A and B. So the document ID cannot = item ID since
 the docs contain items from both item ID spaces. In which case I don't know
 why they would be in the same doc at all but that is another discussion.
 This model does not allow us to fetch a doc by ID.

 But in our case since we have the same IDs in A and B we can put them in a
 doc of ID=item ID, the field similair_items can contain items from B
 similarityMatrix rows since they are the same as columns, the
 cross_action_similar_items field will contain columns from [B'A]

 This may just be mental looping--sleep only work about 50% of the time for
 me so maybe someone else can check this reasoning. Have a look at the data
 here
 https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx


 On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.comjavascript:;
 wrote:

 Yes, storing the similar_items in a field, cross_action_similar_items in
 another field all on the same doc ided by item ID. Agree that there may be
 other fields.

 Storing the rows of [B'B] is ok because it's symmetric. However we did
 talk about the [B'A] case and I thought we agreed to store the rows there
 too because they were from Bs items. This was the discussion about having
 different items for cross actions. The excerpt below is Ted responding to
 my question. So do we want the columns of [B'A]? It's only a transpose away.


  On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel 
  p...@occamsmachete.comjavascript:;
 wrote:
  [B'A] =
 iphone  ipadnexus   galaxy  surface
  iphone  2   2   2   1   0
  ipad2   2   2   1   0
  nexus   1   1   1   1   0
  galaxy  1   1   1   1   0
  surface 0   0   0   0   1
 
  The rows are what we want from [B'A] since the row items are from B,
 right?
 
  Yes.
 
  It is easier to understand if you have different kinds of items as well
 as different actions.  For instance, suppose that you have user x query
 terms (A) and user x device (B).  B'A is then device x term so that there
 is a row per device and the 

Re: Setting up a recommender

2013-08-02 Thread Andrew Psaltis
On 8/2/13 12:13 PM, B Lyon bradfl...@gmail.com wrote:


The way the links in this pagerank explorer highlight
the relevant portions of the google matrix (
https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are
lots

That is pretty darn cool, great job!



Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
This first cut project explicitly assumes a unified user and item space. This 
works well for many action pairs, not for others. The reason I did this to 
begin with was for using multiple actions for ecom recs. Views were not very 
predictive of purchases alone and needed the cross-recommender treatment. We 
did this using Mahout matrix math so the issue of what to write to Solr did not 
come up. It worked fine but now we find the need for an online method that will 
make use of realtime generated preferences, so ones not in the batch training 
data.

The math still works for multiple item spaces but users must be in common. More 
generally the rank and ID space currently associated with users must be the 
same.

Feel free to create examples if you want. Ted has some ideas for using multiple 
item spaces in presos that are on Slideshare I think.


On Aug 2, 2013, at 10:13 AM, B Lyon bradfl...@gmail.com wrote:

I think the sheet is very helpful.

I was wondering about having at least one of the examples be where the
actions deal with completely different things to maybe make it easier for
newbies like me to grok the main points: purchases of items of type blah
and views of videos, say.  I think the input file has the same setup etc.

I don't get the issue/questions that come up when we do have separate
items.  And I thought Ted mentioned at one point that the weighting of
recommendation vectors might not be necessary based on some kind of solr
magic, but I have no idea what that is.

Btw, i was already thinking of doing something for my own
clarification/edification that is similar to your spreadsheet, but would be
a web page where a mouseover on one piece highlights the other pieces that
generated it... E.g. The way the links in this pagerank explorer highlight
the relevant portions of the google matrix (
https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are lots
of other different pieces here of course,
but show connections soup-to-nuts as much as possible.



On Friday, August 2, 2013, Pat Ferrel wrote:

 I put some thought into this (actually I slept on it) and I think the
 answer is in the math.
 
 -- A = matrix of action2 by user, used for cross-action recommendations,
 for instance action2 = views.
 -- B = matrix of action1 by user, these are the primary recommenders
 actions, for instance action1 = purchases.
 -- H_a1 = all user history of action1 in column vectors. This may be all
 action1's recorded and so may = B' or it may have truncated history to get
 more recent activity in recs.
 -- H_a2 = all user history of action2 in column vectors. This may be all
 action2's recorded and so may = A' or it may have truncated history to get
 more recent activity in recs.
 -- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for
 action1.
 -- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there
 was also an action1. recommendation are for action1.
 -- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they
 are weighted to optimize results.
 
 The query on [B'A] will be column vectors from  H_a2. Each is a user's
 history of action2 on A items. That is if there were different items in A
 than B then the query would be comprised of those items and against the
 field that contains those items. This brings up a bunch of other questions
 but for now we do not have separate items.
 
 It illustrates the fact that the query is user history of action2 so the
 items (though they have the same ID space in this case) should be from A or
 there would be no hits.
 
 Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so
 rows are the same as columns.
 
 The confusion may come from the fact that Ted's mental model does not have
 the same items for both A and B. So the document ID cannot = item ID since
 the docs contain items from both item ID spaces. In which case I don't know
 why they would be in the same doc at all but that is another discussion.
 This model does not allow us to fetch a doc by ID.
 
 But in our case since we have the same IDs in A and B we can put them in a
 doc of ID=item ID, the field similair_items can contain items from B
 similarityMatrix rows since they are the same as columns, the
 cross_action_similar_items field will contain columns from [B'A]
 
 This may just be mental looping--sleep only work about 50% of the time for
 me so maybe someone else can check this reasoning. Have a look at the data
 here
 https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx
 
 
 On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.comjavascript:;
 wrote:
 
 Yes, storing the similar_items in a field, cross_action_similar_items in
 another field all on the same doc ided by item ID. Agree that there may be
 other fields.
 
 Storing the rows of [B'B] is ok because it's symmetric. However we did
 talk about the [B'A] case and I thought we agreed to store the rows there
 too because 

Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
We doing a hangout at 2 on the Solr recommender?


Re: Setting up a recommender

2013-08-02 Thread B Lyon
I am planning on sitting in as flaky connection allows.
On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 We doing a hangout at 2 on the Solr recommender?



Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
Assuming Ted needs to call it, not sure if an invite has gone out, I haven't 
seen one.

On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:

I am planning on sitting in as flaky connection allows.
On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 We doing a hangout at 2 on the Solr recommender?
 



Re: Setting up a recommender

2013-08-02 Thread Ted Dunning
So there is a lot of good discussion here and there were some key ideas.

The first idea is that the *input* to a recommender is on the right in the
matrix notation.  This refers inherently to the id's on the columns of the
recommender product (either B'B or B'A).  The columns are defined by the
right hand element of the product (either B or A in the B'B and B'A
respectively).

The results are in the row space and are defined by the left hand operand
of the product.  IN the case of B'A and B'B, the left hand operand is B in
both cases so the row space is consistent.

In order to implement this in a search engine, we need documents that
correspond to rows of B'A or B'B.  These are the same as the columns of B.
 The fields of the documents will necessarily include the following:

id: the column id from B corresponding to this item
description: presentation info ... yada yada
b-a-links: contents of this row of B'A expressed as id's from the column
space of A where this row  of llr-filter(B'A) contains a
non-zero value.
b-b-links: contents of this row of B'B expressed as id's from the column
space of B ...


The following operations are now single queries:

get an item where id = x
   query is [id:x]

recommend based on behavior with regard to A items and actions h_a
   query is [b-a-links: h_a]

recommend based on behavior with regard to B items and actions h_b
   query is [b-b-links: h_b]

recommend based on a single item with id = x
query is [b-b-links: x]

recommend based on composite behavior composed of h_a and h_b
query is [b-a-links: h_a b-b-links: h_b]

Does this make sense by being more explicit?

Now, it is pretty clear that we could have an index of A objects as well
but the link fields would have to be a-a-links and a-b-links, of course.




On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.

 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:

 I am planning on sitting in as flaky connection allows.
 On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  We doing a hangout at 2 on the Solr recommender?
 




Re: Setting up a recommender

2013-08-02 Thread Sebastian Schelter
I really like this approach, especially as it makes it possible to
individually recompute and update certain similarity matrices. Furthermore
it should enable rapid experimentation as its super easy to retrieve
recommendations based on differed behaviors.

2013/8/2 Ted Dunning ted.dunn...@gmail.com

 So there is a lot of good discussion here and there were some key ideas.

 The first idea is that the *input* to a recommender is on the right in the
 matrix notation.  This refers inherently to the id's on the columns of the
 recommender product (either B'B or B'A).  The columns are defined by the
 right hand element of the product (either B or A in the B'B and B'A
 respectively).

 The results are in the row space and are defined by the left hand operand
 of the product.  IN the case of B'A and B'B, the left hand operand is B in
 both cases so the row space is consistent.

 In order to implement this in a search engine, we need documents that
 correspond to rows of B'A or B'B.  These are the same as the columns of B.
  The fields of the documents will necessarily include the following:

 id: the column id from B corresponding to this item
 description: presentation info ... yada yada
 b-a-links: contents of this row of B'A expressed as id's from the column
 space of A where this row  of llr-filter(B'A) contains a
 non-zero value.
 b-b-links: contents of this row of B'B expressed as id's from the column
 space of B ...


 The following operations are now single queries:

 get an item where id = x
query is [id:x]

 recommend based on behavior with regard to A items and actions h_a
query is [b-a-links: h_a]

 recommend based on behavior with regard to B items and actions h_b
query is [b-b-links: h_b]

 recommend based on a single item with id = x
 query is [b-b-links: x]

 recommend based on composite behavior composed of h_a and h_b
 query is [b-a-links: h_a b-b-links: h_b]

 Does this make sense by being more explicit?

 Now, it is pretty clear that we could have an index of A objects as well
 but the link fields would have to be a-a-links and a-b-links, of course.




 On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Assuming Ted needs to call it, not sure if an invite has gone out, I
  haven't seen one.
 
  On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
  I am planning on sitting in as flaky connection allows.
  On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
   We doing a hangout at 2 on the Solr recommender?
  
 
 



Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
Thanks, well put.

In order to have the ultimate impl with two id spaces for A and B would we have 
to create different docs for A'B and B'B? Since the docs IDs must come from A 
or B? The fields can contain different sets of IDs but the Doc ID must be one 
or the other, right? Doesn't this imply separate indexes for the separate A, B 
item IDs spaces? This is not a question for this first cut impl but is a 
generalization question.

On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

So there is a lot of good discussion here and there were some key ideas.

The first idea is that the *input* to a recommender is on the right in the
matrix notation.  This refers inherently to the id's on the columns of the
recommender product (either B'B or B'A).  The columns are defined by the
right hand element of the product (either B or A in the B'B and B'A
respectively).

The results are in the row space and are defined by the left hand operand
of the product.  IN the case of B'A and B'B, the left hand operand is B in
both cases so the row space is consistent.

In order to implement this in a search engine, we need documents that
correspond to rows of B'A or B'B.  These are the same as the columns of B.
The fields of the documents will necessarily include the following:

id: the column id from B corresponding to this item
description: presentation info ... yada yada
b-a-links: contents of this row of B'A expressed as id's from the column
space of A where this row  of llr-filter(B'A) contains a
non-zero value.
b-b-links: contents of this row of B'B expressed as id's from the column
space of B ...


The following operations are now single queries:

get an item where id = x
  query is [id:x]

recommend based on behavior with regard to A items and actions h_a
  query is [b-a-links: h_a]

recommend based on behavior with regard to B items and actions h_b
  query is [b-b-links: h_b]

recommend based on a single item with id = x
   query is [b-b-links: x]

recommend based on composite behavior composed of h_a and h_b
   query is [b-a-links: h_a b-b-links: h_b]

Does this make sense by being more explicit?

Now, it is pretty clear that we could have an index of A objects as well
but the link fields would have to be a-a-links and a-b-links, of course.




On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.
 
 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
 I am planning on sitting in as flaky connection allows.
 On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 We doing a hangout at 2 on the Solr recommender?
 
 
 



Re: Setting up a recommender

2013-08-02 Thread Pat Ferrel
Got away with that stupid comment. All doc ids will be from B items even in the 
general case.

On Aug 2, 2013, at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote:

Thanks, well put.

In order to have the ultimate impl with two id spaces for A and B would we have 
to create different docs for A'B and B'B? Since the docs IDs must come from A 
or B? The fields can contain different sets of IDs but the Doc ID must be one 
or the other, right? Doesn't this imply separate indexes for the separate A, B 
item IDs spaces? This is not a question for this first cut impl but is a 
generalization question.

On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote:

So there is a lot of good discussion here and there were some key ideas.

The first idea is that the *input* to a recommender is on the right in the
matrix notation.  This refers inherently to the id's on the columns of the
recommender product (either B'B or B'A).  The columns are defined by the
right hand element of the product (either B or A in the B'B and B'A
respectively).

The results are in the row space and are defined by the left hand operand
of the product.  IN the case of B'A and B'B, the left hand operand is B in
both cases so the row space is consistent.

In order to implement this in a search engine, we need documents that
correspond to rows of B'A or B'B.  These are the same as the columns of B.
The fields of the documents will necessarily include the following:

id: the column id from B corresponding to this item
description: presentation info ... yada yada
b-a-links: contents of this row of B'A expressed as id's from the column
space of A where this row  of llr-filter(B'A) contains a
non-zero value.
b-b-links: contents of this row of B'B expressed as id's from the column
space of B ...


The following operations are now single queries:

get an item where id = x
 query is [id:x]

recommend based on behavior with regard to A items and actions h_a
 query is [b-a-links: h_a]

recommend based on behavior with regard to B items and actions h_b
 query is [b-b-links: h_b]

recommend based on a single item with id = x
  query is [b-b-links: x]

recommend based on composite behavior composed of h_a and h_b
  query is [b-a-links: h_a b-b-links: h_b]

Does this make sense by being more explicit?

Now, it is pretty clear that we could have an index of A objects as well
but the link fields would have to be a-a-links and a-b-links, of course.




On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Assuming Ted needs to call it, not sure if an invite has gone out, I
 haven't seen one.
 
 On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote:
 
 I am planning on sitting in as flaky connection allows.
 On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 We doing a hangout at 2 on the Solr recommender?
 
 
 




Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Not following so…

Here so is what I've done in probably too much detail:

1) ingest raw log files and split them up by action
2) turn these into Mahout preference files using Mahout type IDs, keeping a map 
of IDs
3) run the Mahout Item-based recommender using LLR for similarity
4) created a Mahout style cross-recommender using cooccurrence similarity using 
matrix math
5) given two similairty matrixes and a user history matrix I am writing them to 
csv files with Mahout ID replaced by the original string external IDs for users 
and items

input log file before splitting:
u1  purchaseiphone
u1  purchaseipad
u2  purchasenexus-tablet
u2  purchasegalaxy
u3  purchasesurface
u4  purchaseiphone
u4  purchaseipad
u1  viewiphone
u1  viewipad
u1  viewnexus-tablet
u1  viewgalaxy
u2  viewiphone
u2  viewipad
u2  viewnexus-tablet
u2  viewgalaxy
u3  viewsurface
u4  viewiphone
u4  viewipad
u4  viewnexus-tablet


Input user history DRM after ID translation to mahout IDs and splitting for 
action purchase

B   user/item   iphone  ipadnexus-tabletgalaxy  surface
u1  1   1   0   0   0
u2  0   0   1   1   0
u3  0   0   0   0   1
u4  1   1   0   0   0

Map of IDs Mahout to Original/External
0 - iphone
1 - ipad
2 - nexus-tablet
3 - galaxy
4 - surface

To be specific the DRM from the RecommenderJob with item-item similarities 
using LLR looks like this:
Input Path: out/p-recs/sims/part-r-0
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: {1:0.8472157541208549}
Key: 1: Value: {0:0.8472157541208549}
Key: 2: Value: {3:0.8181382096075936}
Key: 3: Value: {2:0.8181382096075936}
Key: 4: Value: {}

This will be written to a directory for later Solr indexing as a csv of the 
form:
item_id,similar_items,cross_action_similar_items
iphone,ipad,
ipad,iphone,
nexus-tablet,galaxy,
galaxy, nexus-tablet,
surface,,

By using a user's history vector as a query you get results = recommendations
So if the user is u1, the history vector is:
iphone ipad

The Solr results for query iphone ipad using field similar_items will be 
1. Doc ID, ipad
2. Doc ID, iphone

If you want item similarities, for instance if a user is anonymous with no 
history and is looking at an iphone product page. You would fetch the doc for 
id =  iphone and get:
ipad

Perhaps a bad example for ordering, since there is only one ID in the doc but 
the items in the similar_items field would be ordered by similarity strength. 

Likewise for the cross-action similarities though the matrix will have 
cooccurrence [B'A] values in the DRM.

For item similarities there is no need to do more than fetch one doc that 
contains the similarities, right? I've successfully used this method with the 
Mahout recommender but please correct me if something above is wrong. 


On Jul 31, 2013, at 4:52 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Pat,

See inline


On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote:

 So the XML as CSV would be:
 item_id,similar_items,cross_action_similar_items
 ipad,iphone,iphone nexus
 iphone,ipad,ipad galaxy
 

Right.  Doesn't matter what format.  Might want quotes around space
delimited lists, but anything will do.


 
 Note: As I mentioned before the order of the items in the field will
 encode rank of the similarity strength. This is for cases where you want to
 find similar items to a context item. You would fetch the doc for the
 context item by it's item ID and show the top k items in the doc. Ted's
 caveat would probably be to dither them.
 

I always say dither so that is an easy one.

But fetching similar items of a center item by fetching the center item and
then fetching each of the referenced items is typically slower by about 2x
than running the search for mentions of the center item.


 Sounds like Ted is generating data. Andrew or M Lyon do either of you want
 to set the demo system up? If so you'll need to find a system--free tier
 AWS, Ted's box, etc. Then install all the needed stuff.
 
 I'll get the output working to csv.
 
 On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 OK and yes. The docs will look like:
 
 add
   doc
  field name='item_id'ipad/field
  field name='similar_items'iphone/field
  field name='cross_action_similar_items'iphone nexus/field
   /doc
  doc
field name='item_id'iphone/field
field name='similar_items'ipad/field
field name='cross_action_similar_items'ipad galaxy/field
  /doc
 /add
 
 
 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:
 
 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the 

Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:


 For item similarities there is no need to do more than fetch one doc that
 contains the similarities, right? I've successfully used this method with
 the Mahout recommender but please correct me if something above is wrong.


No.

First, you need to retrieve all the other documents that are referenced to
get their display meta-data. So this isn't just a one document fetch.

Second, the similar items point inwards, not outwards.  Thus, the query you
want has the id of the current item and searches the similar_items field.
 The result of that search is all of the similar items.

The confusion here may stem from the name of the field.  A name like
linked-from-items or some such might help here.


Another way to look at this is that there should be no procedural
difference if you have 10 items or 20 in your history.  Either way, your
history is a query against the appropriate link fields.  Likewise, there
should be no difference between having 10 items or 2 items in your history.
 There shouldn't even be any difference if you have even just 1 item in
your history.

Finding items similar to a single item is exactly like having 1 item in
your history.  So that should be done by searching with that one item in
the appropriate link fields.


Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Sorry to be dense but I think there is some miscommunication. The most 
important question is: am I writing the item-item similarity matrix DRM out to 
Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender 
this is in tmp/similarityMatrix. If not then please stop me. If I'm off base 
here, maybe a skype or im session will straighten me out. pat.fer...@gmail.com 
or p...@occamsmachete.com


To be clear below I'm not talking about history based recs, which is the 
primary use case. I am talking about a query that does not use history, that 
only finds similar items based on training data. The item-item similarity 
matrix DRM contains Key = item ID, Value = list of item IDs with similarity 
strengths.

This is equivalent to the list returned by ItemBasedRecommender's
public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws 
TasteException

Specified by:
mostSimilarItems in interface ItemBasedRecommender

Parameters:
itemID - ID of item for which to find most similar other items
howMany - desired number of most similar items to find

Returns:
items most similar to the given item, ordered from most similar to least

To get the list from Solr you would fetch the doc associated with itemID, no? 

When using the Mahout mapreduce item-based recommender we get the similarity 
matrix and do just that. We get the row associated with the Mahout itemID and 
recommend the top k items from the vector. This performs well in 
cross-validation tests.



On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote:

On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:

 
 For item similarities there is no need to do more than fetch one doc that
 contains the similarities, right? I've successfully used this method with
 the Mahout recommender but please correct me if something above is wrong.


No.

First, you need to retrieve all the other documents that are referenced to
get their display meta-data. So this isn't just a one document fetch.

Second, the similar items point inwards, not outwards.  Thus, the query you
want has the id of the current item and searches the similar_items field.
The result of that search is all of the similar items.

The confusion here may stem from the name of the field.  A name like
linked-from-items or some such might help here.


Another way to look at this is that there should be no procedural
difference if you have 10 items or 20 in your history.  Either way, your
history is a query against the appropriate link fields.  Likewise, there
should be no difference between having 10 items or 2 items in your history.
There shouldn't even be any difference if you have even just 1 item in
your history.

Finding items similar to a single item is exactly like having 1 item in
your history.  So that should be done by searching with that one item in
the appropriate link fields.



Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Sorry to be dense but I think there is some miscommunication. The most
 important question is: am I writing the item-item similarity matrix DRM out
 to Solr, one row = one Solr doc?


Each row = one *field* in a Solr doc.  Different DRM's produce different
fields in the same docs.

There will also be item meta-data in the field.


 For the mapreduce Mahout Item-based recommender this is in
 tmp/similarityMatrix. If not then please stop me. If I'm off base here,
 maybe a skype or im session will straighten me out. pat.ferrel@gmail.comor
 p...@occamsmachete.com


Actually, that is a grand idea.  Let's do a hangout.

From the 
who-is-free-whenhttps://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewformsurvey,
it looks like lots of people are available tomorrow at 2PM PDT.

Would that work?

To be clear below I'm not talking about history based recs, which is the
 primary use case. I am talking about a query that does not use history,
 that only finds similar items based on training data. The item-item
 similarity matrix DRM contains Key = item ID, Value = list of item IDs with
 similarity strengths.


Yes.  I absolutely agree that you can do this.

These should, strictly speaking, be columns in the item-item matrix.  The
item-item matrix may or may not be symmetric.  If it is symmetric, then
column or row doesn't matter.


 This is equivalent to the list returned by ItemBasedRecommender's
 public ListRecommendedItem mostSimilarItems(long itemID, int howMany)
 throws TasteException


Yes.


 Specified by:
 mostSimilarItems in interface ItemBasedRecommender

 Parameters:
 itemID - ID of item for which to find most similar other items
 howMany - desired number of most similar items to find

 Returns:
 items most similar to the given item, ordered from most similar to least

 To get the list from Solr you would fetch the doc associated with
 itemID, no?


If you store the column, then yes.

If you store the row, then using a query on the field containing the
similar items is the right answer.

The key difference that I have is what happens in the next step.

When using the Mahout mapreduce item-based recommender we get the
 similarity matrix and do just that. We get the row associated with the
 Mahout itemID and recommend the top k items from the vector. This performs
 well in cross-validation tests.


Good.

I think that there is a row/column confusion here, but they are probably
nearly identical in your application.

The key point is what happens *after* you do the query that you are
suggesting.

In your case, you have to retrieve the meta-data associated with each of
related items.  I like to store this meta-data in a Solr field (or three)
so this involves at least one additional query.  You can automatically
chain this second query by using the join operation that Solr provides,
but the second query still happens.

If you do the query the way that I suggest, this second query doesn't need
to happen.  You get the meta-data directly.








 On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:

 
  For item similarities there is no need to do more than fetch one doc that
  contains the similarities, right? I've successfully used this method with
  the Mahout recommender but please correct me if something above is wrong.


 No.

 First, you need to retrieve all the other documents that are referenced to
 get their display meta-data. So this isn't just a one document fetch.

 Second, the similar items point inwards, not outwards.  Thus, the query you
 want has the id of the current item and searches the similar_items field.
 The result of that search is all of the similar items.

 The confusion here may stem from the name of the field.  A name like
 linked-from-items or some such might help here.


 Another way to look at this is that there should be no procedural
 difference if you have 10 items or 20 in your history.  Either way, your
 history is a query against the appropriate link fields.  Likewise, there
 should be no difference between having 10 items or 2 items in your history.
 There shouldn't even be any difference if you have even just 1 item in
 your history.

 Finding items similar to a single item is exactly like having 1 item in
 your history.  So that should be done by searching with that one item in
 the appropriate link fields.




Re: Setting up a recommender

2013-08-01 Thread B Lyon
I am wondering about row/column confusion as well - fleshing out the
doc/design with more specifics (which Pat is kind of doing, basically)
should make things obvious eventually, imo.

The way Pat had phrased it got me to wondering what rationale you use to
rank the results when you are querying the columns (similar column,
similar via action 2 column, etc.).

He had mentioned the auxiliary case of simply getting most similar items to
a given docid by just going to the row for that docid and using the
pre-sorted values in the similar column, and I thought Ted might have
hinted that you could just as well do a solr query of the column with that
single docid as the query; however, in the latter case I wonder if the
order and list itself could be weird, as some items may show up simply
because they are not similar to many things: lower LLR values that got
filtered in the list for docid itself won't get filtered when you're
looking at the other not similar to very many items things when
generating their list for the solr field..  I guess using an absolute
cutoff for LLR in the filtering could deal with some of this issue.  All
hypothetical at the moment (for me, anyway), as real data might trivially
dismiss some of these concerns as irrelevant.

I think the hangout is a good idea, too, btw, and hope to be able to sit in
if it happens.  Very excited about this approach.

On Thu, Aug 1, 2013 at 6:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  Sorry to be dense but I think there is some miscommunication. The most
  important question is: am I writing the item-item similarity matrix DRM
 out
  to Solr, one row = one Solr doc?


 Each row = one *field* in a Solr doc.  Different DRM's produce different
 fields in the same docs.

 There will also be item meta-data in the field.


  For the mapreduce Mahout Item-based recommender this is in
  tmp/similarityMatrix. If not then please stop me. If I'm off base here,
  maybe a skype or im session will straighten me out.
 pat.ferrel@gmail.comor
  p...@occamsmachete.com


 Actually, that is a grand idea.  Let's do a hangout.

 From the who-is-free-when
 https://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewform
 survey,
 it looks like lots of people are available tomorrow at 2PM PDT.

 Would that work?

 To be clear below I'm not talking about history based recs, which is the
  primary use case. I am talking about a query that does not use history,
  that only finds similar items based on training data. The item-item
  similarity matrix DRM contains Key = item ID, Value = list of item IDs
 with
  similarity strengths.
 

 Yes.  I absolutely agree that you can do this.

 These should, strictly speaking, be columns in the item-item matrix.  The
 item-item matrix may or may not be symmetric.  If it is symmetric, then
 column or row doesn't matter.


  This is equivalent to the list returned by ItemBasedRecommender's
  public ListRecommendedItem mostSimilarItems(long itemID, int howMany)
  throws TasteException
 

 Yes.


  Specified by:
  mostSimilarItems in interface ItemBasedRecommender
 
  Parameters:
  itemID - ID of item for which to find most similar other items
  howMany - desired number of most similar items to find
 
  Returns:
  items most similar to the given item, ordered from most similar to least
 
  To get the list from Solr you would fetch the doc associated with
  itemID, no?
 

 If you store the column, then yes.

 If you store the row, then using a query on the field containing the
 similar items is the right answer.

 The key difference that I have is what happens in the next step.

 When using the Mahout mapreduce item-based recommender we get the
  similarity matrix and do just that. We get the row associated with the
  Mahout itemID and recommend the top k items from the vector. This
 performs
  well in cross-validation tests.
 

 Good.

 I think that there is a row/column confusion here, but they are probably
 nearly identical in your application.

 The key point is what happens *after* you do the query that you are
 suggesting.

 In your case, you have to retrieve the meta-data associated with each of
 related items.  I like to store this meta-data in a Solr field (or three)
 so this involves at least one additional query.  You can automatically
 chain this second query by using the join operation that Solr provides,
 but the second query still happens.

 If you do the query the way that I suggest, this second query doesn't need
 to happen.  You get the meta-data directly.





 
 
 
  On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
  
   For item similarities there is no need to do more than fetch one doc
 that
   contains the similarities, right? I've successfully used this method
 with
   the Mahout recommender but please correct me if something 

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Yes, storing the similar_items in a field, cross_action_similar_items in 
another field all on the same doc ided by item ID. Agree that there may be 
other fields.

Storing the rows of [B'B] is ok because it's symmetric. However we did talk 
about the [B'A] case and I thought we agreed to store the rows there too 
because they were from Bs items. This was the discussion about having different 
items for cross actions. The excerpt below is Ted responding to my question. So 
do we want the columns of [B'A]? It's only a transpose away.


 On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:
 [B'A] =
 iphone  ipadnexus   galaxy  surface
 iphone  2   2   2   1   0
 ipad2   2   2   1   0
 nexus   1   1   1   1   0
 galaxy  1   1   1   1   0
 surface 0   0   0   0   1
 
 The rows are what we want from [B'A] since the row items are from B, right?
 
 Yes.
 
 It is easier to understand if you have different kinds of items as well as 
 different actions.  For instance, suppose that you have user x query terms 
 (A) and user x device (B).  B'A is then device x term so that there is a row 
 per device and the row contains terms.  This is good when searching for 
 devices using terms.


Talking about getting the actual doc field values, which will include the 
similar_items field and other metadata. The actual ids in the similar_items 
field work well for anonymous/no-history recs but maybe there is a second query 
or fetch that I'm missing? I assumed that a fetch of the doc and it's fields  
by item ID was as fast a way to do this as possible. If there is some way to 
get the same result by doing a query that is faster, I'm all for it?

Can do tomorrow at 2.

Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
A few architectural questions: http://bit.ly/18vbbaT

I created a local instance of the LucidWorks Search on my dev machine. I can 
quite easily save the similarity vectors from the DRMs into docs at special 
locations and index them with LucidWorks. But to ingest the docs and put them 
in separate fields of the same index we need some new code (unless I've missed 
some Lucid config magic) that does the indexing and integrates with LucidWorks. 

I imagine two indexes. One index for the similarity matrix and optionally the 
cross-similairty matrix in two fields of type 'string'. Another index for 
users' history--we could put the docs there for retrieval by user ID. The user 
history docs then become the query on the similarity index and would return 
recommendations. Or any realtime collected or generated history could be used 
too.

Is this what you imagined Ted? Especially WRT Lucid integration?

Someone could probably donate their free tier EC2 instance and set this up 
pretty easily. Not sure if this would fit given free tier memory but maybe for 
small data sets.

To get this available for actual use we'd need:
1-- An instance with an IP address somewhere to run the ingestion and 
customized LucidWorks Search.
2-- Synthetic data created using Ted's tool.
3-- Customized Solr indexing code for integration with LucidWorks? Not sure how 
this is done. I can do the Solr part but have not looked into Lucid integration 
yet.
4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running 
example.

Assuming I've got this right, does someone want to help with these?

Another way to approach this is to create a stand alone codebase that requires 
Mahout and Solr and supplies an API something like the proposed Mahout SGD 
online recommender or Myrrix. This would be easier to consume but would lack 
all the UI and inspection code of LucidWorks. 






Re: Setting up a recommender

2013-07-31 Thread Andrew Psaltis
Assuming I've got this right, does someone want to help with these?
Pat -- I would be interested in helping in anyway needed. I believe Ted's
tool is a start, but does not handle all the case envisioned in the design
doc, although I could be wrong on this. Anyway I'm pretty open to helping
wherever needed.

Thanks,
Andrew





On 7/31/13 12:20 PM, Pat Ferrel pat.fer...@gmail.com wrote:

A few architectural questions: http://bit.ly/18vbbaT

I created a local instance of the LucidWorks Search on my dev machine. I
can quite easily save the similarity vectors from the DRMs into docs at
special locations and index them with LucidWorks. But to ingest the docs
and put them in separate fields of the same index we need some new code
(unless I've missed some Lucid config magic) that does the indexing and
integrates with LucidWorks.

I imagine two indexes. One index for the similarity matrix and optionally
the cross-similairty matrix in two fields of type 'string'. Another index
for users' history--we could put the docs there for retrieval by user ID.
The user history docs then become the query on the similarity index and
would return recommendations. Or any realtime collected or generated
history could be used too.

Is this what you imagined Ted? Especially WRT Lucid integration?

Someone could probably donate their free tier EC2 instance and set this
up pretty easily. Not sure if this would fit given free tier memory but
maybe for small data sets.

To get this available for actual use we'd need:
1-- An instance with an IP address somewhere to run the ingestion and
customized LucidWorks Search.
2-- Synthetic data created using Ted's tool.
3-- Customized Solr indexing code for integration with LucidWorks? Not
sure how this is done. I can do the Solr part but have not looked into
Lucid integration yet.
4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally
running example.

Assuming I've got this right, does someone want to help with these?

Another way to approach this is to create a stand alone codebase that
requires Mahout and Solr and supplies an API something like the proposed
Mahout SGD online recommender or Myrrix. This would be easier to consume
but would lack all the UI and inspection code of LucidWorks.







Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
OK, looks like there *is* some magic in the Lucid config. I believe all I need 
to do is  write out the docs using Solr XML defining fields for each similarity 
type and the doc name. The rest can be done by standard Lucid hand 
configuration. I believe this will minimally handle #3 below.


On Jul 31, 2013, at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote:

A few architectural questions: http://bit.ly/18vbbaT

I created a local instance of the LucidWorks Search on my dev machine. I can 
quite easily save the similarity vectors from the DRMs into docs at special 
locations and index them with LucidWorks. But to ingest the docs and put them 
in separate fields of the same index we need some new code (unless I've missed 
some Lucid config magic) that does the indexing and integrates with LucidWorks. 

I imagine two indexes. One index for the similarity matrix and optionally the 
cross-similairty matrix in two fields of type 'string'. Another index for 
users' history--we could put the docs there for retrieval by user ID. The user 
history docs then become the query on the similarity index and would return 
recommendations. Or any realtime collected or generated history could be used 
too.

Is this what you imagined Ted? Especially WRT Lucid integration?

Someone could probably donate their free tier EC2 instance and set this up 
pretty easily. Not sure if this would fit given free tier memory but maybe for 
small data sets.

To get this available for actual use we'd need:
1-- An instance with an IP address somewhere to run the ingestion and 
customized LucidWorks Search.
2-- Synthetic data created using Ted's tool.
3-- Customized Solr indexing code for integration with LucidWorks? Not sure how 
this is done. I can do the Solr part but have not looked into Lucid integration 
yet.
4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running 
example.

Assuming I've got this right, does someone want to help with these?

Another way to approach this is to create a stand alone codebase that requires 
Mahout and Solr and supplies an API something like the proposed Mahout SGD 
online recommender or Myrrix. This would be easier to consume but would lack 
all the UI and inspection code of LucidWorks. 







Re: Setting up a recommender

2013-07-31 Thread B Lyon
I'm interested in helping as well.
Btw I thought that what was stored in the solr fields were the llr-filtered
items (ids I guess) for the could-be-recommended things.
 On Jul 31, 2013 2:31 PM, Andrew Psaltis andrew.psal...@webtrends.com
wrote:

 Assuming I've got this right, does someone want to help with these?
 Pat -- I would be interested in helping in anyway needed. I believe Ted's
 tool is a start, but does not handle all the case envisioned in the design
 doc, although I could be wrong on this. Anyway I'm pretty open to helping
 wherever needed.

 Thanks,
 Andrew





 On 7/31/13 12:20 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 A few architectural questions: http://bit.ly/18vbbaT
 
 I created a local instance of the LucidWorks Search on my dev machine. I
 can quite easily save the similarity vectors from the DRMs into docs at
 special locations and index them with LucidWorks. But to ingest the docs
 and put them in separate fields of the same index we need some new code
 (unless I've missed some Lucid config magic) that does the indexing and
 integrates with LucidWorks.
 
 I imagine two indexes. One index for the similarity matrix and optionally
 the cross-similairty matrix in two fields of type 'string'. Another index
 for users' history--we could put the docs there for retrieval by user ID.
 The user history docs then become the query on the similarity index and
 would return recommendations. Or any realtime collected or generated
 history could be used too.
 
 Is this what you imagined Ted? Especially WRT Lucid integration?
 
 Someone could probably donate their free tier EC2 instance and set this
 up pretty easily. Not sure if this would fit given free tier memory but
 maybe for small data sets.
 
 To get this available for actual use we'd need:
 1-- An instance with an IP address somewhere to run the ingestion and
 customized LucidWorks Search.
 2-- Synthetic data created using Ted's tool.
 3-- Customized Solr indexing code for integration with LucidWorks? Not
 sure how this is done. I can do the Solr part but have not looked into
 Lucid integration yet.
 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally
 running example.
 
 Assuming I've got this right, does someone want to help with these?
 
 Another way to approach this is to create a stand alone codebase that
 requires Mahout and Solr and supplies an API something like the proposed
 Mahout SGD online recommender or Myrrix. This would be easier to consume
 but would lack all the UI and inspection code of LucidWorks.
 
 
 
 




Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
OK and yes. The docs will look like:

add 
doc 
   field name='item_id'ipad/field 
   field name='similar_items'iphone/field 
   field name='cross_action_similar_items'iphone nexus/field 
/doc 
   doc 
 field name='item_id'iphone/field 
 field name='similar_items'ipad/field 
 field name='cross_action_similar_items'ipad galaxy/field 
   /doc 
/add


On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:

I'm interested in helping as well.
Btw I thought that what was stored in the solr fields were the llr-filtered
items (ids I guess) for the could-be-recommended things.


Re: Setting up a recommender

2013-07-31 Thread Ted Dunning
On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 A few architectural questions: http://bit.ly/18vbbaT

 I created a local instance of the LucidWorks Search on my dev machine. I
 can quite easily save the similarity vectors from the DRMs into docs at
 special locations and index them with LucidWorks. But to ingest the docs
 and put them in separate fields of the same index we need some new code
 (unless I've missed some Lucid config magic) that does the indexing and
 integrates with LucidWorks.

 I imagine two indexes. One index for the similarity matrix and optionally
 the cross-similairty matrix in two fields of type 'string'. Another index
 for users' history--we could put the docs there for retrieval by user ID.
 The user history docs then become the query on the similarity index and
 would return recommendations. Or any realtime collected or generated
 history could be used too.

 Is this what you imagined Ted? Especially WRT Lucid integration?


Yes.  And I note in a later email that you discovered how Lucid provides
lots of connectors for different formats.  XML is fine.  I have also used
CSV.


 Someone could probably donate their free tier EC2 instance and set this up
 pretty easily. Not sure if this would fit given free tier memory but maybe
 for small data sets.


It should fit, actually.

I can donate a real-ish machine as well.



 To get this available for actual use we'd need:
 1-- An instance with an IP address somewhere to run the ingestion and
 customized LucidWorks Search.
 2-- Synthetic data created using Ted's tool.
 3-- Customized Solr indexing code for integration with LucidWorks? Not
 sure how this is done. I can do the Solr part but have not looked into
 Lucid integration yet.
 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally
 running example.

 Assuming I've got this right, does someone want to help with these?


I will work on synthetic data later today.  I have a tool that does this
for drill.  I plan to pull down musicBrainz and use the tags on artists as
hidden variables to drive synthetic user behavior.  Should produce
reasonable looking recommendations.

Another way to approach this is to create a stand alone codebase that
 requires Mahout and Solr and supplies an API something like the proposed
 Mahout SGD online recommender or Myrrix. This would be easier to consume
 but would lack all the UI and inspection code of LucidWorks.


I think that for a demo, the inspection is crucial.

Adding the API is easy and can even be done in the same instance as LW is
running.


Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
The input, which we need synthesized is a log file tsv or csv that looks like 
this:

u1  purchaseiphone
u1  purchaseipad
u2  purchasenexus-tablet
u2  purchasegalaxy
u3  purchasesurface
u4  purchaseiphone
u4  purchaseipad
u1  viewiphone
u1  viewipad
u1  viewnexus-tablet
u1  viewgalaxy
u2  viewiphone
u2  viewipad
u2  viewnexus-tablet
u2  viewgalaxy
u3  viewsurface
u4  viewiphone
u4  viewipad
u4  viewnexus-tablet

This is the example in the github project 
solr-recommender/src/test/resources/logged-preferences/*

The columns can be in any order and can have other columns interspersed.

For testing it would be nice to have one action, two, and several. This 
implementation is in-memory for mapping ids so nothing huge as far as how many 
ids are generated. 

Ted can talk about the distribution of actions.

On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:

I'm interested in helping as well.
Btw I thought that what was stored in the solr fields were the llr-filtered
items (ids I guess) for the could-be-recommended things.
On Jul 31, 2013 2:31 PM, Andrew Psaltis andrew.psal...@webtrends.com
wrote:

 Assuming I've got this right, does someone want to help with these?
 Pat -- I would be interested in helping in anyway needed. I believe Ted's
 tool is a start, but does not handle all the case envisioned in the design
 doc, although I could be wrong on this. Anyway I'm pretty open to helping
 wherever needed.
 
 Thanks,
 Andrew
 
 
 
 
 
 On 7/31/13 12:20 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 A few architectural questions: http://bit.ly/18vbbaT
 
 I created a local instance of the LucidWorks Search on my dev machine. I
 can quite easily save the similarity vectors from the DRMs into docs at
 special locations and index them with LucidWorks. But to ingest the docs
 and put them in separate fields of the same index we need some new code
 (unless I've missed some Lucid config magic) that does the indexing and
 integrates with LucidWorks.
 
 I imagine two indexes. One index for the similarity matrix and optionally
 the cross-similairty matrix in two fields of type 'string'. Another index
 for users' history--we could put the docs there for retrieval by user ID.
 The user history docs then become the query on the similarity index and
 would return recommendations. Or any realtime collected or generated
 history could be used too.
 
 Is this what you imagined Ted? Especially WRT Lucid integration?
 
 Someone could probably donate their free tier EC2 instance and set this
 up pretty easily. Not sure if this would fit given free tier memory but
 maybe for small data sets.
 
 To get this available for actual use we'd need:
 1-- An instance with an IP address somewhere to run the ingestion and
 customized LucidWorks Search.
 2-- Synthetic data created using Ted's tool.
 3-- Customized Solr indexing code for integration with LucidWorks? Not
 sure how this is done. I can do the Solr part but have not looked into
 Lucid integration yet.
 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally
 running example.
 
 Assuming I've got this right, does someone want to help with these?
 
 Another way to approach this is to create a stand alone codebase that
 requires Mahout and Solr and supplies an API something like the proposed
 Mahout SGD online recommender or Myrrix. This would be easier to consume
 but would lack all the UI and inspection code of LucidWorks.
 
 
 
 
 
 



Re: Setting up a recommender

2013-07-31 Thread Ted Dunning
The fields actually point the other direction.  They contain items which,
if they appear in a history, indicate that the current document is a good
recommendation.

This reversal of roles is what makes search work.

Going the other way works for a single doc, but that only gives a list of
id's which then have to be retrieved.  Better to have the tags for the
single doc on all the related docs so that a single retrieval will pull
them all in with their details.


On Wed, Jul 31, 2013 at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 OK and yes. The docs will look like:

 add
 doc
field name='item_id'ipad/field
field name='similar_items'iphone/field
field name='cross_action_similar_items'iphone nexus/field
 /doc
doc
  field name='item_id'iphone/field
  field name='similar_items'ipad/field
  field name='cross_action_similar_items'ipad galaxy/field
/doc
 /add


 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:

 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the could-be-recommended things.



Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
I'd vote for csv then.

On Jul 31, 2013, at 12:00 PM, Ted Dunning ted.dunn...@gmail.com wrote:




On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote:
A few architectural questions: http://bit.ly/18vbbaT

I created a local instance of the LucidWorks Search on my dev machine. I can 
quite easily save the similarity vectors from the DRMs into docs at special 
locations and index them with LucidWorks. But to ingest the docs and put them 
in separate fields of the same index we need some new code (unless I've missed 
some Lucid config magic) that does the indexing and integrates with LucidWorks.

I imagine two indexes. One index for the similarity matrix and optionally the 
cross-similairty matrix in two fields of type 'string'. Another index for 
users' history--we could put the docs there for retrieval by user ID. The user 
history docs then become the query on the similarity index and would return 
recommendations. Or any realtime collected or generated history could be used 
too.

Is this what you imagined Ted? Especially WRT Lucid integration?

Yes.  And I note in a later email that you discovered how Lucid provides lots 
of connectors for different formats.  XML is fine.  I have also used CSV.
 



Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
Sorry not sure what you are saying.

If the LLR created DRM has a row:

Key: 0, Value { 1:1.0,}

where 0 - iphone and 1 - ipad then wouldn't the doc look like

doc
  field name='item_id'ipad/field
  field name='similar_items'iphone/field
/doc

or rather the csv equivalent?

On Jul 31, 2013, at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote:

The fields actually point the other direction.  They contain items which,
if they appear in a history, indicate that the current document is a good
recommendation.

This reversal of roles is what makes search work.

Going the other way works for a single doc, but that only gives a list of
id's which then have to be retrieved.  Better to have the tags for the
single doc on all the related docs so that a single retrieval will pull
them all in with their details.


On Wed, Jul 31, 2013 at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 OK and yes. The docs will look like:
 
 add
doc
   field name='item_id'ipad/field
   field name='similar_items'iphone/field
   field name='cross_action_similar_items'iphone nexus/field
/doc
   doc
 field name='item_id'iphone/field
 field name='similar_items'ipad/field
 field name='cross_action_similar_items'ipad galaxy/field
   /doc
 /add
 
 
 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:
 
 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the could-be-recommended things.
 



Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
oops, mistyped…

If the LLR created DRM has a row:

Key: 1, Value { 0:1.0,}

where 0 - iphone and 1 - ipad then wouldn't the doc look like

doc
 field name='item_id'ipad/field
 field name='similar_items'iphone/field
/doc


On Jul 31, 2013, at 12:14 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Sorry not sure what you are saying.

If the LLR created DRM has a row:

Key: 0, Value { 1:1.0,}

where 0 - iphone and 1 - ipad then wouldn't the doc look like

doc
 field name='item_id'ipad/field
 field name='similar_items'iphone/field
/doc

or rather the csv equivalent?

On Jul 31, 2013, at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote:

The fields actually point the other direction.  They contain items which,
if they appear in a history, indicate that the current document is a good
recommendation.

This reversal of roles is what makes search work.

Going the other way works for a single doc, but that only gives a list of
id's which then have to be retrieved.  Better to have the tags for the
single doc on all the related docs so that a single retrieval will pull
them all in with their details.


On Wed, Jul 31, 2013 at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 OK and yes. The docs will look like:
 
 add
   doc
  field name='item_id'ipad/field
  field name='similar_items'iphone/field
  field name='cross_action_similar_items'iphone nexus/field
   /doc
  doc
field name='item_id'iphone/field
field name='similar_items'ipad/field
field name='cross_action_similar_items'ipad galaxy/field
  /doc
 /add
 
 
 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:
 
 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the could-be-recommended things.
 




Re: Setting up a recommender

2013-07-31 Thread Pat Ferrel
So the XML as CSV would be:
item_id,similar_items,cross_action_similar_items
ipad,iphone,iphone nexus
iphone,ipad,ipad galaxy

Note: As I mentioned before the order of the items in the field will encode 
rank of the similarity strength. This is for cases where you want to find 
similar items to a context item. You would fetch the doc for the context item 
by it's item ID and show the top k items in the doc. Ted's caveat would 
probably be to dither them.

Sounds like Ted is generating data. Andrew or M Lyon do either of you want to 
set the demo system up? If so you'll need to find a system--free tier AWS, 
Ted's box, etc. Then install all the needed stuff. 

I'll get the output working to csv.

On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:

OK and yes. The docs will look like:

add 
   doc 
  field name='item_id'ipad/field 
  field name='similar_items'iphone/field 
  field name='cross_action_similar_items'iphone nexus/field 
   /doc 
  doc 
field name='item_id'iphone/field 
field name='similar_items'ipad/field 
field name='cross_action_similar_items'ipad galaxy/field 
  /doc 
/add


On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:

I'm interested in helping as well.
Btw I thought that what was stored in the solr fields were the llr-filtered
items (ids I guess) for the could-be-recommended things.



Re: Setting up a recommender

2013-07-31 Thread B Lyon
Slick idea IMO on the ordering in the field.

Fyi to answer  your question  I am new to a lot of these pieces (and
without sustained access to nontablet pc next four days) and cannot at the
moment be relied on for the demo setup given this apparent pace,  but would
like to help as possible with grunt/doc stuff if someone more familiar with
the relevant pieces can use it.

On Wednesday, July 31, 2013, Pat Ferrel wrote:

 So the XML as CSV would be:
 item_id,similar_items,cross_action_similar_items
 ipad,iphone,iphone nexus
 iphone,ipad,ipad galaxy

 Note: As I mentioned before the order of the items in the field will
 encode rank of the similarity strength. This is for cases where you want to
 find similar items to a context item. You would fetch the doc for the
 context item by it's item ID and show the top k items in the doc. Ted's
 caveat would probably be to dither them.

 Sounds like Ted is generating data. Andrew or M Lyon do either of you want
 to set the demo system up? If so you'll need to find a system--free tier
 AWS, Ted's box, etc. Then install all the needed stuff.

 I'll get the output working to csv.

 On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.comjavascript:;
 wrote:

 OK and yes. The docs will look like:

 add
doc
   field name='item_id'ipad/field
   field name='similar_items'iphone/field
   field name='cross_action_similar_items'iphone nexus/field
/doc
   doc
 field name='item_id'iphone/field
 field name='similar_items'ipad/field
 field name='cross_action_similar_items'ipad galaxy/field
   /doc
 /add


 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com javascript:;
 wrote:

 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the could-be-recommended things.



-- 
BF Lyon
http://www.nowherenearithaca.com


Re: Setting up a recommender

2013-07-31 Thread Ted Dunning
Pat,

See inline


On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote:

 So the XML as CSV would be:
 item_id,similar_items,cross_action_similar_items
 ipad,iphone,iphone nexus
 iphone,ipad,ipad galaxy


Right.  Doesn't matter what format.  Might want quotes around space
delimited lists, but anything will do.



 Note: As I mentioned before the order of the items in the field will
 encode rank of the similarity strength. This is for cases where you want to
 find similar items to a context item. You would fetch the doc for the
 context item by it's item ID and show the top k items in the doc. Ted's
 caveat would probably be to dither them.


I always say dither so that is an easy one.

But fetching similar items of a center item by fetching the center item and
then fetching each of the referenced items is typically slower by about 2x
than running the search for mentions of the center item.


 Sounds like Ted is generating data. Andrew or M Lyon do either of you want
 to set the demo system up? If so you'll need to find a system--free tier
 AWS, Ted's box, etc. Then install all the needed stuff.

 I'll get the output working to csv.

 On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 OK and yes. The docs will look like:

 add
doc
   field name='item_id'ipad/field
   field name='similar_items'iphone/field
   field name='cross_action_similar_items'iphone nexus/field
/doc
   doc
 field name='item_id'iphone/field
 field name='similar_items'ipad/field
 field name='cross_action_similar_items'ipad galaxy/field
   /doc
 /add


 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:

 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the could-be-recommended things.




Re: Setting up a recommender

2013-07-30 Thread Pat Ferrel
Well its a work in progress but you can see it here: 
https://github.com/pferrel/solr-recommender

There is no Solr integration yet, it is just ingest, create id indexes, run 
RecommenderJob, and XRecommenderJob. These create the item similarity matrixes, 
which will be put into Solr. They also create all recommendations for all users.

The code is quite, er..., fresh. If you are actually going to work on the 
project or test it, I can fix things as they come up but not all options are 
supported or needed to get the overall system running. Put bugs in github.

The happy path works with my trivial sample data so I'll proceed to moving the 
sim matrixes to Solr.

I'll revisit robustifying the project later if it proves useful. 



Re: Setting up a recommender

2013-07-30 Thread Pat Ferrel
Actually I'm not sure the downsampling is best put in RowSimilarityJob since 
that doesn't work for the XRecommender. The similarity matrix there is 
calculated by [B'A] matrix multiply. RSJ would be great if it could work on two 
DRMs, then we could use other similarity measures (LLR please).

Also I'm not sure if its needed in RSJ since I use PreparePreferenceMatrixJob 
for the RecommenderJob, which calculates the main action item similarity matrix 
(using RSJ in any case).

But for the XRecommender I modified PreparePreferenceMatrixJob to create two 
DRMs and called it PreparePreferenceMatrixesJob. It has downsampling in it, if 
you mean limiting the number of prefs per user. Check if I'm wrong.


On Jul 29, 2013, at 10:17 PM, Sebastian Schelter s...@apache.org wrote:

Downsampling is now moved directly into RowSimilarityJob. I'll have a
look at Pat's code later this week.

On 23.07.2013 19:38, Ted Dunning wrote:
 On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 This pipeline lacks downsampling since I had to replace
 PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume
 Sebastian is the person to talk to about these bits?
 
 
 I think that is a good source.  If you post your code, he may be able to
 comment on how to integrate the down-sampling in a general way.
 




Re: Setting up a recommender

2013-07-30 Thread Pat Ferrel
In the cross-recommender the similarity matrix is calculated doing [B'A]. We 
want the rows to be stored as the item-item similarities in Solr right? [B'B] 
is symmetric so just want to make sure I have it straight for [B'A].

B = purchases
iphone  ipadnexus   galaxy  surface
u1  1   1   0   0   0
u2  0   0   1   1   0
u3  0   0   0   0   1
u4  1   1   0   0   0

B' =
u1  u2  u3  u4
iphone  1   0   0   1
ipad1   0   0   1
nexus   0   1   0   0
galaxy  0   1   0   0
surface 0   0   1   0

A = views
iphone  ipadnexus   galaxy  surface
u1  1   1   1   1   0
u2  1   1   1   1   0
u3  0   0   0   0   1
u4  1   1   1   0   0


[B'A] =
iphone  ipadnexus   galaxy  surface
iphone  2   2   2   1   0
ipad2   2   2   1   0
nexus   1   1   1   1   0
galaxy  1   1   1   1   0
surface 0   0   0   0   1

The rows are what we want from [B'A] since the row items are from B, right?

Re: Setting up a recommender

2013-07-30 Thread Ted Dunning
On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:

 [B'A] =
 iphone  ipadnexus   galaxy  surface
 iphone  2   2   2   1   0
 ipad2   2   2   1   0
 nexus   1   1   1   1   0
 galaxy  1   1   1   1   0
 surface 0   0   0   0   1

 The rows are what we want from [B'A] since the row items are from B, right?


Yes.

It is easier to understand if you have different kinds of items as well as
different actions.  For instance, suppose that you have user x query terms
(A) and user x device (B).  B'A is then device x term so that there is a
row per device and the row contains terms.  This is good when searching for
devices using terms.


Re: Setting up a recommender

2013-07-29 Thread Sebastian Schelter
Downsampling is now moved directly into RowSimilarityJob. I'll have a
look at Pat's code later this week.

On 23.07.2013 19:38, Ted Dunning wrote:
 On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote:
 
 This pipeline lacks downsampling since I had to replace
 PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume
 Sebastian is the person to talk to about these bits?

 
 I think that is a good source.  If you post your code, he may be able to
 comment on how to integrate the down-sampling in a general way.
 



Re: Setting up a recommender

2013-07-27 Thread Pat Ferrel
I've got a new configurable action splitter working with my old Mahout based 
recommender and cross-recommender. Need more cleanup and testing before 
integrating Solr or handing off. 

I think I'll leave the old recommenders in the code with an option to replace 
the last 'make recommendations' step with moving the similarity matrixes into 
Solr. Might be useful for results comparison. 

We still need more help with retrieving user history vectors, and making Solr 
queries. Not to mention setting up the inspection UI mentioned in Ted's paper.

http://bit.ly/18vbbaT

On Jul 24, 2013, at 8:32 PM, Pat Ferrel pat.fer...@gmail.com wrote:

Understood, catalog categories, tags, etc will make good metadata to be 
included in the query and putting in separate fields allows us to separately 
boost each in the query. UserIDs that have interacted with the item is an 
interesting idea.

However the specific case I'm describing is not about content similarity. 
Talking here about item-item similarity exactly as encoded in the similarity 
matrix. The order or rank of these item-item similarities should be preserved 
and I was proposing doing so with the order of the itemID terms in the document.

The query will return history based recs ranked by the order Solr applies. The 
doc itself for any item contains similar items ordered by their similarity 
magnitude, precalculated in Mahout RowSimilarityJob.


On Jul 24, 2013, at 7:19 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Content based item similarity is a fine thing to include in a separate field.  

In addition, it is reasonable to describe a person's history in terms of the 
meta-data on the items they have interacted with.  That allows you to build a 
set of socially driven meta-data indicators as well.  This can be useful in the 
restaurant example where you might find that elegant or home-style might be 
good indicators for different restaurants even if those terms don't appear in a 
restaurant description.  

Sent from my iPhone

On Jul 23, 2013, at 18:26, Pat Ferrel pat.fer...@gmail.com wrote:

 Honestly not trying to make this more complicated but…
 
 In the purely Mahout cross-recommender we got a ranked list of similar items 
 for any item so we could combine personal history-based recs with 
 non-personalized item similarity-based recs wherever we had an item context. 
 In a past ecom case the item similarity recs were quite useful when a user 
 was looking at an item already. In that case even if the user was unknown we 
 could make item similarity-based recs.
 
 How about if we order the items in the doc by rank in the existing fields 
 since they are just text? Then we would do user-history-based queries on the 
 fields for recs and docs[itemID].field to get the ordered list of items out 
 of any doc. Doing an ensemble would require weights though. Unless someone 
 knows a rank based method for combining results. I guess you could vote or 
 add rank numbers of like items or the log thereof...
 
 I assume the combination of results from [B'B] and [B'A] will be a query over 
 both fields with some boost or other to handle ensemble weighting. But if you 
 want to add item similarity recs another method must be employed, no?
 
 From past experience I strongly suspect item similarity rank is not something 
 we want to lose so unless someone has a better idea I'll just order the IDs 
 in the fields and call it good for now.
 




Re: Setting up a recommender

2013-07-24 Thread Michael Sokolov

On 7/23/13 7:26 PM, Pat Ferrel wrote:

Honestly not trying to make this more complicated but…



 From past experience I strongly suspect item similarity rank is not something 
we want to lose so unless someone has a better idea I'll just order the IDs in 
the fields and call it good for now.


If I understand you correctly, you are concerned about just throwing all 
the items in without regard to order, or weight).  I think Ted's 
suggestion was not to worry about that, but if you do have time and want 
to tackle this, one thing you can do is to add an item multiple times.  
For example, suppose you have items A, B, C, ... with A ranked highest.  
Then index a document in Solr like this:


A A A B B C

this will end up giving A a higher frequency count in the index.

The number of repeats would be kind of arbitrary.  You might want to 
make it a linear function of rank or a quantized version of the 
similarity score.


But this might end up being a noise-level effect ... it's probably not 
worth losing sleep over.  On the other hand, it's probably less useful 
to order the IDs since once they get put in the index the token order 
is stored as a position which isn't (usually) used for scoring, 
although I suppose some custom scorer could do that, too.


-Mike


Re: Setting up a recommender

2013-07-24 Thread Pat Ferrel
I'm most worried about losing ordering and I think I can just order the items A 
B C by convention.

Using Mahout to do clustering we used to double or triple add the title to get 
artificial boosting without fields. The technique works and may be worth an 
experiment later, thanks.

BTW it looks like similarity and TFIDF are plugable in Solr and seem pretty 
easy to change. Planning to use cosine for the first cut since it's default.

On Jul 24, 2013, at 4:10 AM, Michael Sokolov msoko...@safaribooksonline.com 
wrote:

On 7/23/13 7:26 PM, Pat Ferrel wrote:
 Honestly not trying to make this more complicated but…
 
 
 
 From past experience I strongly suspect item similarity rank is not something 
 we want to lose so unless someone has a better idea I'll just order the IDs 
 in the fields and call it good for now.
 
 
If I understand you correctly, you are concerned about just throwing all the 
items in without regard to order, or weight).  I think Ted's suggestion was not 
to worry about that, but if you do have time and want to tackle this, one thing 
you can do is to add an item multiple times.  For example, suppose you have 
items A, B, C, ... with A ranked highest.  Then index a document in Solr like 
this:

A A A B B C

this will end up giving A a higher frequency count in the index.

The number of repeats would be kind of arbitrary.  You might want to make it a 
linear function of rank or a quantized version of the similarity score.

But this might end up being a noise-level effect ... it's probably not worth 
losing sleep over.  On the other hand, it's probably less useful to order the 
IDs since once they get put in the index the token order is stored as a 
position which isn't (usually) used for scoring, although I suppose some 
custom scorer could do that, too.

-Mike



Re: Setting up a recommender

2013-07-24 Thread Ted Dunning
Content based item similarity is a fine thing to include in a separate field.  

In addition, it is reasonable to describe a person's history in terms of the 
meta-data on the items they have interacted with.  That allows you to build a 
set of socially driven meta-data indicators as well.  This can be useful in the 
restaurant example where you might find that elegant or home-style might be 
good indicators for different restaurants even if those terms don't appear in a 
restaurant description.  

Sent from my iPhone

On Jul 23, 2013, at 18:26, Pat Ferrel pat.fer...@gmail.com wrote:

 Honestly not trying to make this more complicated but…
 
 In the purely Mahout cross-recommender we got a ranked list of similar items 
 for any item so we could combine personal history-based recs with 
 non-personalized item similarity-based recs wherever we had an item context. 
 In a past ecom case the item similarity recs were quite useful when a user 
 was looking at an item already. In that case even if the user was unknown we 
 could make item similarity-based recs.
 
 How about if we order the items in the doc by rank in the existing fields 
 since they are just text? Then we would do user-history-based queries on the 
 fields for recs and docs[itemID].field to get the ordered list of items out 
 of any doc. Doing an ensemble would require weights though. Unless someone 
 knows a rank based method for combining results. I guess you could vote or 
 add rank numbers of like items or the log thereof...
 
 I assume the combination of results from [B'B] and [B'A] will be a query over 
 both fields with some boost or other to handle ensemble weighting. But if you 
 want to add item similarity recs another method must be employed, no?
 
 From past experience I strongly suspect item similarity rank is not something 
 we want to lose so unless someone has a better idea I'll just order the IDs 
 in the fields and call it good for now.
 
 
 On Jul 23, 2013, at 12:03 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Will do.
 
 For what it's worth…
 
 The project I'm working on is an online recommender for video content. You go 
 to a site I'm creating, make some picks and get recommendations immediately 
 online. The training data comes from mining rotten tomatoes for critics 
 reviews. There are two actions, rotten  fresh. Was planning to toss the 
 'rotten' except for filtering them out of any recs but maybe they would work 
 as A with an ensemble weight of -1? New thumbs up or down data would be put 
 into the training set periodically--not online--using the process outlined 
 below.
 
 On Jul 23, 2013, at 10:37 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 
 This sounds great.  Go for it.  Put a comment on the design doc with a 
 pointer to text that I should import.
 
 
 
 
 On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote:
 I can supply:
 
 1) a Maven based project in a public github repo as a baseline that creates 
 the following
 2) ingest and split actions, in-memory, single process, from text file, one 
 line per preference
 3) create DistributedRowMatrixes one per action (max of 3) with unified item 
 and user space
 4) create the 'similarity matrix' for [B'B] using LLR and [B'A] using matrix 
 multiply/cooccurrence.
 5) can take a stab at loading Solr.  I know the Mahout side and the internal 
 to external ID translation. The Solr side sounds pretty simple for this case.
 
 This pipeline lacks downsampling since I had to replace 
 PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian 
 is the person to talk to about these bits?
 
 The job this creates uses the hadoop script to launch. Each job extends 
 AbstractJob so runs locally or using HDFS or mapreduce (at least for the 
 Mahout parts).
 
 I have some obligations coming up so if you want this I'll need to get 
 moving. I can have the project ready on github in a day or two. May take 
 longer to do the Solr integration and if someone has a passion for taking 
 that bit on--great. This work is in my personal plans for the next couple 
 weeks as it happens anyway.
 
 Let me know if you want me to proceed.
 
 On Jul 22, 2013, at 3:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote:
 
 Yes.  And the combined recommender would query on both at the same time.
 
 Pat-- doesn't it need ensemble type weighting for each recommender
 component? Probably a wishlist item for later?
 
 
 Yes.  Weighting different fields differently is a very nice (and very easy
 feature).
 
 
 
 


Re: Setting up a recommender

2013-07-24 Thread Pat Ferrel
Understood, catalog categories, tags, etc will make good metadata to be 
included in the query and putting in separate fields allows us to separately 
boost each in the query. UserIDs that have interacted with the item is an 
interesting idea.

However the specific case I'm describing is not about content similarity. 
Talking here about item-item similarity exactly as encoded in the similarity 
matrix. The order or rank of these item-item similarities should be preserved 
and I was proposing doing so with the order of the itemID terms in the document.

The query will return history based recs ranked by the order Solr applies. The 
doc itself for any item contains similar items ordered by their similarity 
magnitude, precalculated in Mahout RowSimilarityJob.

 
On Jul 24, 2013, at 7:19 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Content based item similarity is a fine thing to include in a separate field.  

In addition, it is reasonable to describe a person's history in terms of the 
meta-data on the items they have interacted with.  That allows you to build a 
set of socially driven meta-data indicators as well.  This can be useful in the 
restaurant example where you might find that elegant or home-style might be 
good indicators for different restaurants even if those terms don't appear in a 
restaurant description.  

Sent from my iPhone

On Jul 23, 2013, at 18:26, Pat Ferrel pat.fer...@gmail.com wrote:

 Honestly not trying to make this more complicated but…
 
 In the purely Mahout cross-recommender we got a ranked list of similar items 
 for any item so we could combine personal history-based recs with 
 non-personalized item similarity-based recs wherever we had an item context. 
 In a past ecom case the item similarity recs were quite useful when a user 
 was looking at an item already. In that case even if the user was unknown we 
 could make item similarity-based recs.
 
 How about if we order the items in the doc by rank in the existing fields 
 since they are just text? Then we would do user-history-based queries on the 
 fields for recs and docs[itemID].field to get the ordered list of items out 
 of any doc. Doing an ensemble would require weights though. Unless someone 
 knows a rank based method for combining results. I guess you could vote or 
 add rank numbers of like items or the log thereof...
 
 I assume the combination of results from [B'B] and [B'A] will be a query over 
 both fields with some boost or other to handle ensemble weighting. But if you 
 want to add item similarity recs another method must be employed, no?
 
 From past experience I strongly suspect item similarity rank is not something 
 we want to lose so unless someone has a better idea I'll just order the IDs 
 in the fields and call it good for now.
 



Re: Setting up a recommender

2013-07-23 Thread Ted Dunning
This sounds great.  Go for it.  Put a comment on the design doc with a
pointer to text that I should import.




On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote:

 I can supply:

 1) a Maven based project in a public github repo as a baseline that
 creates the following
 2) ingest and split actions, in-memory, single process, from text file,
 one line per preference
 3) create DistributedRowMatrixes one per action (max of 3) with unified
 item and user space
 4) create the 'similarity matrix' for [B'B] using LLR and [B'A] using
 matrix multiply/cooccurrence.
 5) can take a stab at loading Solr.  I know the Mahout side and the
 internal to external ID translation. The Solr side sounds pretty simple for
 this case.

 This pipeline lacks downsampling since I had to replace
 PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume
 Sebastian is the person to talk to about these bits?

 The job this creates uses the hadoop script to launch. Each job extends
 AbstractJob so runs locally or using HDFS or mapreduce (at least for the
 Mahout parts).

 I have some obligations coming up so if you want this I'll need to get
 moving. I can have the project ready on github in a day or two. May take
 longer to do the Solr integration and if someone has a passion for taking
 that bit on--great. This work is in my personal plans for the next couple
 weeks as it happens anyway.

 Let me know if you want me to proceed.

 On Jul 22, 2013, at 3:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com
 wrote:

  Yes.  And the combined recommender would query on both at the same time.
 
  Pat-- doesn't it need ensemble type weighting for each recommender
  component? Probably a wishlist item for later?


 Yes.  Weighting different fields differently is a very nice (and very easy
 feature).




Re: Setting up a recommender

2013-07-23 Thread Ted Dunning
On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote:

 This pipeline lacks downsampling since I had to replace
 PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume
 Sebastian is the person to talk to about these bits?


I think that is a good source.  If you post your code, he may be able to
comment on how to integrate the down-sampling in a general way.


Re: Setting up a recommender

2013-07-23 Thread Pat Ferrel
Will do.

For what it's worth…

The project I'm working on is an online recommender for video content. You go 
to a site I'm creating, make some picks and get recommendations immediately 
online. The training data comes from mining rotten tomatoes for critics 
reviews. There are two actions, rotten  fresh. Was planning to toss the 
'rotten' except for filtering them out of any recs but maybe they would work as 
A with an ensemble weight of -1? New thumbs up or down data would be put into 
the training set periodically--not online--using the process outlined below.

On Jul 23, 2013, at 10:37 AM, Ted Dunning ted.dunn...@gmail.com wrote:


This sounds great.  Go for it.  Put a comment on the design doc with a pointer 
to text that I should import.




On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote:
I can supply:

1) a Maven based project in a public github repo as a baseline that creates the 
following
2) ingest and split actions, in-memory, single process, from text file, one 
line per preference
3) create DistributedRowMatrixes one per action (max of 3) with unified item 
and user space
4) create the 'similarity matrix' for [B'B] using LLR and [B'A] using matrix 
multiply/cooccurrence.
5) can take a stab at loading Solr.  I know the Mahout side and the internal to 
external ID translation. The Solr side sounds pretty simple for this case.

This pipeline lacks downsampling since I had to replace 
PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is 
the person to talk to about these bits?

The job this creates uses the hadoop script to launch. Each job extends 
AbstractJob so runs locally or using HDFS or mapreduce (at least for the Mahout 
parts).

I have some obligations coming up so if you want this I'll need to get moving. 
I can have the project ready on github in a day or two. May take longer to do 
the Solr integration and if someone has a passion for taking that bit 
on--great. This work is in my personal plans for the next couple weeks as it 
happens anyway.

Let me know if you want me to proceed.

On Jul 22, 2013, at 3:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Yes.  And the combined recommender would query on both at the same time.

 Pat-- doesn't it need ensemble type weighting for each recommender
 component? Probably a wishlist item for later?


Yes.  Weighting different fields differently is a very nice (and very easy
feature).





Re: Setting up a recommender

2013-07-22 Thread Pat Ferrel
+10

Love the academics but I agree with this. Recently saw a VP from Netflix plead 
with the audience (mostly academics) to move past RMSE--focus on maximizing 
correct ranking, not rating prediction. 

Anyway I have a pipeline that does the following:
ingests logs either TSV or CSV of arbitrary column ordering--will pick out the 
actions by position and string 
replaces PreparePreferenceMatrixJob to create n matrixes depending on the 
number of actions you are splitting out. This job also creates external - 
internal item and user id BiHashMaps for going back and forth between the log's 
IDs and Mahout internal IDs. It guarantees a uniform item and user ID space and 
sparse matrix ranks by creating one from all actions. Not completely scalable 
since it is not done in m/r though it uses HDFS--I have a plan to m/r the 
process and get rid of the hashmap.
performs the RowSimilarityJob on the primary matrix B and does B'A to create 
a cooccurrence matrix for primary and secondary actions.
It then goes on to use the rest of the mahout pipeline on B to get recs and 
does a [B'A]H_v to calculate all cross-recommendations.
Stores all recs from all models in a NoSQL DB.
At rec request time it does a linear combination of req and cross-rec to return 
the highest scored ones. The stored IDs were external so all ready for display.
Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written to 
Solr as the original external IDs from the log files, which were strings. This 
allows them to be treated as terms by Solr.

My understanding of the Solr proposal puts B's row similarity matrix in a 
vector per item. That means each row is turned into terms = external IDs--not 
sure how the weights of each term are encoded.  So the cross-recommender would 
just put the cross-action similarity matrix  in other field(s) on the same 
itemID/docID, right?

Then the straight out recommender queries on the B'B field(s) and the 
cross-recommender queries on the B'A field(s). I suppose to keep it simple the 
cross-action similarity matrix could be put in a separate index.  Is this about 
right?

On Jul 21, 2013, at 5:30 PM, Sebastian Schelter s...@apache.org wrote:

At the moment, the down sampling is done by PreparePreferenceMatrixJob
for the collaborative filtering functionality. We just want to move it
down to RowSimilarityJob to enable standalone usage.

I think that the CrossRecommender should be the next thing on our
agenda, after we have the deployment infrastructure.  I especially like
that its capable to include different kinds of interactions, as opposed
to most other (academically motivated) recommenders that focus on a
single interaction type like a rating.

--sebastian

On 22.07.2013 02:14, Ted Dunning wrote:
 The row similarity downsampling is just a matter of dropping elements at
 random from rows that have more data than we want.
 
 If the join that puts the row together can handle two kinds of input, then
 RowSimilarity can be easily modified to be CrossRowSimilarity.  Likewise,
 if we have two DRM's with the same row id's in the same order, we can do a
 map-side merge.  Such a merge can be very efficient on a system like MapR
 where you can control files to live on the same nodes.
 
 
 On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 RowSimilarity downsampling? Are you referring to the a mod of the matrix
 multiply to do cross similarity with LLR for the cross recommendations? So
 similarity of rows of B with rows of A?
 
 Sounds like you are proposing not only putting a recommender in Solr but
 also a cross-recommender? This is why getting a real data set is
 problematic?
 
 On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Pat,
 
 Yes.  The first part probably just is the RowSimilarity job, especially
 after Sebastian puts in the down-sampling.
 
 The new part is exactly as you say, storing the DRM into Solr indexes.
 
 There is no reason to not use a real data set.  There is a strong reason to
 use a synthetic dataset, however, in that it can be trivially scaled up and
 down both in items and users.  Also, the synthetic dataset doesn't require
 that the real data be found and downloaded.
 
 
 
 On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 Read the paper, and the preso.
 
 As to the 'offline to Solr' part. It sounds like you are suggesting an
 item item similarity matrix be stored and indexed in Solr. One would have
 to create the action matrix from user profile data (preference history),
 do
 a rowsimiarity job on it (using LLR similarity) and move the result to
 Solr. The first part of this is nearly identical to the current
 recommender
 job workflow and could pretty easily be created from it I think. The new
 part is taking the DistributedRowMatrix and storing it in a particular
 way
 in Solr, right?
 
 BTW Is there some reason not to use an existing real data set?
 
 On Jul 19, 2013, at 3:45 PM, Ted Dunning 

Re: Setting up a recommender

2013-07-22 Thread Michael Sokolov

On 07/22/2013 12:20 PM, Pat Ferrel wrote:


My understanding of the Solr proposal puts B's row similarity matrix in a vector per 
item. That means each row is turned into terms = external IDs--not sure how 
the weights of each term are encoded.
This is the key question for me. The best idea I've had is to use 
termFreq as a proxy for weight.  It's only an integer, so there are 
scaling issues to consider, but you can apply a per-field weight to 
manage that.  Also, Lucene (and Solr) doesn't provide an obvious way to 
load term frequencies directly: probably the simplest thing to do is 
just to repeat the cross-term N times and let the text analysis take 
care of counting them.  Inefficient, but probably the quickest way to 
get going.  Alternatively, there are some lower level Lucene indexing 
APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, 
but would allow for more direct loading of fields.


Then one probably wants to override the scoring in some way (unless 
TFIDF is the way to go somehow??)




Re: Setting up a recommender

2013-07-22 Thread Gokhan Capan
Just to make sure if I understood correctly, Ted, could you please correct
me?:)


1. Using a search engine, I will treat items as documents, where each
document vector consists of other items (similar to words of documents)
with co-occurrence (LLR) weights (instead of tf-idf in a search engine
analogy).
So for each item I will have a sparse vector that represents the relation
of that item to other items, if there is an indicator that makes the
item-to-item similarity (co-occurrence) non-zero. (I will only use positive
feedback, I think, since I am counting co-occurrences)

2. To present recommendations, the system formulates a query, with a
history of items --the session history for task based recommendation, or a
long term history. And the search engine will find top-N items, based on
the cosine similarities of the item (document) vectors and history (query)
vectors.

3. For example, if that was a restaurant recommendation, and we knew that
the restaurant was famous for its sushi, I would index this in another
field, famous_for.
Now if, as a user, I asked for sushi restaurants that I would enjoy, the
system would add this to query along with my history, and the famous sushi
restaurant would rank higher in results, even if chances are equal that I
would like a steakhouse according to the computation in 2.

4. Since this is a search engine, and a search engine can boost a
particular field, the system would let the famous_for overweigh the
collaborative activity, or the opposite (According to the use case, or for
example, number of items in the history) So I can define a weighting
(voting, or mixture of experts) scheme to blend different recommenders.


Are those correct?


Gokhan


On Mon, Jul 22, 2013 at 9:07 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 07/22/2013 12:20 PM, Pat Ferrel wrote:


 My understanding of the Solr proposal puts B's row similarity matrix in a
 vector per item. That means each row is turned into terms = external
 IDs--not sure how the weights of each term are encoded.

 This is the key question for me. The best idea I've had is to use termFreq
 as a proxy for weight.  It's only an integer, so there are scaling issues
 to consider, but you can apply a per-field weight to manage that.  Also,
 Lucene (and Solr) doesn't provide an obvious way to load term frequencies
 directly: probably the simplest thing to do is just to repeat the
 cross-term N times and let the text analysis take care of counting them.
  Inefficient, but probably the quickest way to get going.  Alternatively,
 there are some lower level Lucene indexing APIs (DocFieldConsumer et al)
 which I haven't really plumbed entirely, but would allow for more direct
 loading of fields.

 Then one probably wants to override the scoring in some way (unless TFIDF
 is the way to go somehow??)




Re: Setting up a recommender

2013-07-22 Thread Ted Dunning
My experience is that TFIDF works just fine, especially as first cut.

Adding different kinds of data, building out backend A/B testing, tuning
the UI, weighting the query all come the next round of weighting changes.
 Typically, the priority stack never empties enough for that task to rise
to the top.


On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 07/22/2013 12:20 PM, Pat Ferrel wrote:


 My understanding of the Solr proposal puts B's row similarity matrix in a
 vector per item. That means each row is turned into terms = external
 IDs--not sure how the weights of each term are encoded.

 This is the key question for me. The best idea I've had is to use termFreq
 as a proxy for weight.  It's only an integer, so there are scaling issues
 to consider, but you can apply a per-field weight to manage that.  Also,
 Lucene (and Solr) doesn't provide an obvious way to load term frequencies
 directly: probably the simplest thing to do is just to repeat the
 cross-term N times and let the text analysis take care of counting them.
  Inefficient, but probably the quickest way to get going.  Alternatively,
 there are some lower level Lucene indexing APIs (DocFieldConsumer et al)
 which I haven't really plumbed entirely, but would allow for more direct
 loading of fields.

 Then one probably wants to override the scoring in some way (unless TFIDF
 is the way to go somehow??)




Re: Setting up a recommender

2013-07-22 Thread Ted Dunning
Inline ... slightly redundant relative to other answers, but that shouldn't
be a problem.


On Mon, Jul 22, 2013 at 11:56 AM, Gokhan Capan gkhn...@gmail.com wrote:

 Just to make sure if I understood correctly, Ted, could you please correct
 me?:)


 1. Using a search engine, I will treat items as documents, where each
 document vector consists of other items (similar to words of documents)
 with co-occurrence (LLR) weights (instead of tf-idf in a search engine
 analogy).


LLR will just select indicators.  Weighting can be done using native TF-IDF
stuff that Solr already does.


 So for each item I will have a sparse vector that represents the relation
 of that item to other items, if there is an indicator that makes the
 item-to-item similarity (co-occurrence) non-zero. (I will only use positive
 feedback, I think, since I am counting co-occurrences)


Yes.

Moreover, there will ultimately be multiple fields with different sets of
indicators.  This is how cross recommendation can be integrated.



 2. To present recommendations, the system formulates a query, with a
 history of items --the session history for task based recommendation, or a
 long term history. And the search engine will find top-N items, based on
 the cosine similarities of the item (document) vectors and history (query)
 vectors.


Yes.  Cosine-ish ... the search engine has its own similarity calculation.
 That can be tuned ... later.



 3. For example, if that was a restaurant recommendation, and we knew that
 the restaurant was famous for its sushi, I would index this in another
 field, famous_for.
 Now if, as a user, I asked for sushi restaurants that I would enjoy, the
 system would add this to query along with my history, and the famous sushi
 restaurant would rank higher in results, even if chances are equal that I
 would like a steakhouse according to the computation in 2.


Yes.

Moreover, we might put all the words in the descriptions of restaurants you
have been to lately into a different history field.  Each restaurant would
also have an indicator word field against which we could query using your
history words.

Similarly, we could use cuisine classifiers.

And we can compute a local favorite feature that is essentially a
recommendation indicator from people in a particular area to restaurants.

Recommendation queries can include any or all of these.  Specialized pages
might have a cuisine specific recommendation set for you.



 4. Since this is a search engine, and a search engine can boost a
 particular field, the system would let the famous_for overweigh the
 collaborative activity, or the opposite (According to the use case, or for
 example, number of items in the history) So I can define a weighting
 (voting, or mixture of experts) scheme to blend different recommenders.


yes.  I would recommend doing the blending in the search engine query
itself.


 Are those correct?


Pretty much!


Re: Setting up a recommender

2013-07-22 Thread Ted Dunning
On Mon, Jul 22, 2013 at 9:20 AM, Pat Ferrel p...@occamsmachete.com wrote:

 +10

 Love the academics but I agree with this. Recently saw a VP from Netflix
 plead with the audience (mostly academics) to move past RMSE--focus on
 maximizing correct ranking, not rating prediction.

 Anyway I have a pipeline that does *[ingest, prepare, row-similarity, not
 in m/r]*


Is this available?

replaces PreparePreferenceMatrixJob to create n matrixes depending on the
 number of actions you are splitting out. This job also creates external -
 internal item and user id BiHashMaps for going back and forth between the
 log's IDs and Mahout internal IDs. It guarantees a uniform item and user ID
 space and sparse matrix ranks by creating one from all actions. Not
 completely scalable since it is not done in m/r though it uses HDFS--I have
 a plan to m/r the process and get rid of the hashmap.


Frankly, doing it outside of map-reduce is good for a start and should be
preserved for later.  It makes on-boarding new folks much easier.


 performs the RowSimilarityJob on the primary matrix B and does B'A to
 create a cooccurrence matrix for primary and secondary actions.


What code do you use for B'A?


 Stores all recs from all models in a NoSQL DB.


I recommend not doing this for the demo, but rather storing rows of B'A and
B'B as fields in Solr.


 At rec request time it does a linear combination of req and cross-rec to
 return the highest scored ones.


Should be integrated into the query.


 Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written
 to Solr as the original external IDs from the log files, which were
 strings. This allows them to be treated as terms by Solr.


Yes.  These early steps are very much what I was aiming for.


 My understanding of the Solr proposal puts B's row similarity matrix in a
 vector per item.


For a particular item document, the corresponding row of B'A and the
corresponding row of B'B go into separate fields.  I think you mean B'B
when you say B's row similarity matrix.  Just checking.



 That means each row is turned into terms = external IDs--not sure how
 the weights of each term are encoded.


Again, I just use native Solr weighting.


 So the cross-recommender would just put the cross-action similarity matrix
  in other field(s) on the same itemID/docID, right?


Yes.  Exactly.



 Then the straight out recommender queries on the B'B field(s) and the
 cross-recommender queries on the B'A field(s). I suppose to keep it simple
 the cross-action similarity matrix could be put in a separate index.  Is
 this about right?


Yes.  And the combined recommender would query on both at the same time.


Re: Setting up a recommender

2013-07-22 Thread Michael Sokolov
So you are proposing just grabbing the top N scoring related items and 
indexing listing them without regard to weight?  Effectively quantizing 
the weights to = 1, and 0 for everything else?  I guess LLR tends to do 
that anyway


-Mike

On 07/22/2013 02:57 PM, Ted Dunning wrote:

My experience is that TFIDF works just fine, especially as first cut.

Adding different kinds of data, building out backend A/B testing, tuning
the UI, weighting the query all come the next round of weighting changes.
  Typically, the priority stack never empties enough for that task to rise
to the top.


On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:


On 07/22/2013 12:20 PM, Pat Ferrel wrote:


My understanding of the Solr proposal puts B's row similarity matrix in a
vector per item. That means each row is turned into terms = external
IDs--not sure how the weights of each term are encoded.


This is the key question for me. The best idea I've had is to use termFreq
as a proxy for weight.  It's only an integer, so there are scaling issues
to consider, but you can apply a per-field weight to manage that.  Also,
Lucene (and Solr) doesn't provide an obvious way to load term frequencies
directly: probably the simplest thing to do is just to repeat the
cross-term N times and let the text analysis take care of counting them.
  Inefficient, but probably the quickest way to get going.  Alternatively,
there are some lower level Lucene indexing APIs (DocFieldConsumer et al)
which I haven't really plumbed entirely, but would allow for more direct
loading of fields.

Then one probably wants to override the scoring in some way (unless TFIDF
is the way to go somehow??)






Re: Setting up a recommender

2013-07-22 Thread Pat Ferrel
inline

BTW if there is an LLR cross-similarity job (replacing [B'A] it is easy to 
integrate.


On Jul 22, 2013, at 12:09 PM, Ted Dunning ted.dunn...@gmail.com wrote:

On Mon, Jul 22, 2013 at 9:20 AM, Pat Ferrel p...@occamsmachete.com wrote:

 +10
 
 Love the academics but I agree with this. Recently saw a VP from Netflix
 plead with the audience (mostly academics) to move past RMSE--focus on
 maximizing correct ranking, not rating prediction.
 
 Anyway I have a pipeline that does *[ingest, prepare, row-similarity, not
 in m/r]*
 

Is this available?

Pat-- Can quickly be. In Github. I'd have to clean up a bit.


replaces PreparePreferenceMatrixJob to create n matrixes depending on the
 number of actions you are splitting out. This job also creates external -
 internal item and user id BiHashMaps for going back and forth between the
 log's IDs and Mahout internal IDs. It guarantees a uniform item and user ID
 space and sparse matrix ranks by creating one from all actions. Not
 completely scalable since it is not done in m/r though it uses HDFS--I have
 a plan to m/r the process and get rid of the hashmap.
 

Frankly, doing it outside of map-reduce is good for a start and should be
preserved for later.  It makes on-boarding new folks much easier.

Pat-- It uses the hadoop version of the matrix mult and RowSimilarityJob in 
later steps but they work without a cluster in local mode.


 performs the RowSimilarityJob on the primary matrix B and does B'A to
 create a cooccurrence matrix for primary and secondary actions.
 

What code do you use for B'A?

Pat-- matrix transposes and multiply from Mahout.

 Stores all recs from all models in a NoSQL DB.
 

I recommend not doing this for the demo, but rather storing rows of B'A and
B'B as fields in Solr.

Pat-- yes, just explaining for completeness

 At rec request time it does a linear combination of req and cross-rec to
 return the highest scored ones.


Should be integrated into the query.

Pat-- yes, just explaining for completeness

 Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written
 to Solr as the original external IDs from the log files, which were
 strings. This allows them to be treated as terms by Solr.
 

Yes.  These early steps are very much what I was aiming for.

Pat-- OK, happy to contribute if possible let me know who to coordinate with.

 My understanding of the Solr proposal puts B's row similarity matrix in a
 vector per item.


For a particular item document, the corresponding row of B'A and the
corresponding row of B'B go into separate fields.  I think you mean B'B
when you say B's row similarity matrix.  Just checking.

Pat-- yes, exactly

 That means each row is turned into terms = external IDs--not sure how
 the weights of each term are encoded.


Again, I just use native Solr weighting.

Pat-- good, that makes this fairly simple I expect. Just fields with bags of 
term strings.


 So the cross-recommender would just put the cross-action similarity matrix
 in other field(s) on the same itemID/docID, right?
 

Yes.  Exactly.


 
 Then the straight out recommender queries on the B'B field(s) and the
 cross-recommender queries on the B'A field(s). I suppose to keep it simple
 the cross-action similarity matrix could be put in a separate index.  Is
 this about right?
 

Yes.  And the combined recommender would query on both at the same time.

Pat-- doesn't it need ensemble type weighting for each recommender component? 
Probably a wishlist item for later?



Re: Setting up a recommender

2013-07-22 Thread Ted Dunning
On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Yes.  And the combined recommender would query on both at the same time.

 Pat-- doesn't it need ensemble type weighting for each recommender
 component? Probably a wishlist item for later?


Yes.  Weighting different fields differently is a very nice (and very easy
feature).


Re: Setting up a recommender

2013-07-22 Thread Ted Dunning
Not entirely without regard to weight.  Just without regard to designing
weights specific to this application.  The weights that Solr uses natively
are intuitively what we want (rare indicators have higher weights in a
log-ish kind of way).

Frankly, I doubt the effectiveness here of mathematical reasoning for
getting a better weighting.  The deviations from optimal relative to the
Solr defaults are probably as large as the deviations from the assumptions
that the mathematically motivated weightings are based on.  Fixing this is
spending a lot for small potatoes.   Fixing the data flow and getting
access to more data is far higher value.



On Mon, Jul 22, 2013 at 12:18 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 So you are proposing just grabbing the top N scoring related items and
 indexing listing them without regard to weight?  Effectively quantizing the
 weights to = 1, and 0 for everything else?  I guess LLR tends to do that
 anyway

 -Mike


 On 07/22/2013 02:57 PM, Ted Dunning wrote:

 My experience is that TFIDF works just fine, especially as first cut.

 Adding different kinds of data, building out backend A/B testing, tuning
 the UI, weighting the query all come the next round of weighting changes.
   Typically, the priority stack never empties enough for that task to rise
 to the top.


 On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov 
 msoko...@safaribooksonline.com** wrote:

  On 07/22/2013 12:20 PM, Pat Ferrel wrote:

  My understanding of the Solr proposal puts B's row similarity matrix in
 a
 vector per item. That means each row is turned into terms = external
 IDs--not sure how the weights of each term are encoded.

  This is the key question for me. The best idea I've had is to use
 termFreq
 as a proxy for weight.  It's only an integer, so there are scaling issues
 to consider, but you can apply a per-field weight to manage that.  Also,
 Lucene (and Solr) doesn't provide an obvious way to load term frequencies
 directly: probably the simplest thing to do is just to repeat the
 cross-term N times and let the text analysis take care of counting them.
   Inefficient, but probably the quickest way to get going.
  Alternatively,
 there are some lower level Lucene indexing APIs (DocFieldConsumer et al)
 which I haven't really plumbed entirely, but would allow for more direct
 loading of fields.

 Then one probably wants to override the scoring in some way (unless TFIDF
 is the way to go somehow??)






Re: Setting up a recommender

2013-07-22 Thread Michael Sokolov
Fair enough - thanks for clarifying. I wondered whether that would be 
worth the trouble, also.  Maybe one the academics Pat mentioned will 
test and find out for us :)



On 7/22/13 6:45 PM, Ted Dunning wrote:


Not entirely without regard to weight.  Just without regard to 
designing weights specific to this application.  The weights that Solr 
uses natively are intuitively what we want (rare indicators have 
higher weights in a log-ish kind of way).


Frankly, I doubt the effectiveness here of mathematical reasoning for 
getting a better weighting.  The deviations from optimal relative to 
the Solr defaults are probably as large as the deviations from the 
assumptions that the mathematically motivated weightings are based on. 
 Fixing this is spending a lot for small potatoes.   Fixing the data 
flow and getting access to more data is far higher value.



On Mon, Jul 22, 2013 at 12:18 PM, Michael Sokolov 
msoko...@safaribooksonline.com 
mailto:msoko...@safaribooksonline.com wrote:


So you are proposing just grabbing the top N scoring related items
and indexing listing them without regard to weight?  Effectively
quantizing the weights to = 1, and 0 for everything else?  I guess
LLR tends to do that anyway

-Mike


On 07/22/2013 02:57 PM, Ted Dunning wrote:

My experience is that TFIDF works just fine, especially as
first cut.

Adding different kinds of data, building out backend A/B
testing, tuning
the UI, weighting the query all come the next round of
weighting changes.
  Typically, the priority stack never empties enough for that
task to rise
to the top.


On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov 
msoko...@safaribooksonline.com
mailto:msoko...@safaribooksonline.com wrote:

On 07/22/2013 12:20 PM, Pat Ferrel wrote:

My understanding of the Solr proposal puts B's row
similarity matrix in a
vector per item. That means each row is turned into
terms = external
IDs--not sure how the weights of each term are encoded.

This is the key question for me. The best idea I've had is
to use termFreq
as a proxy for weight.  It's only an integer, so there are
scaling issues
to consider, but you can apply a per-field weight to
manage that.  Also,
Lucene (and Solr) doesn't provide an obvious way to load
term frequencies
directly: probably the simplest thing to do is just to
repeat the
cross-term N times and let the text analysis take care of
counting them.
  Inefficient, but probably the quickest way to get going.
 Alternatively,
there are some lower level Lucene indexing APIs
(DocFieldConsumer et al)
which I haven't really plumbed entirely, but would allow
for more direct
loading of fields.

Then one probably wants to override the scoring in some
way (unless TFIDF
is the way to go somehow??)








Re: Setting up a recommender

2013-07-21 Thread Stevo Slavić
I see Ted created JIRA ticket for this already:
https://issues.apache.org/jira/browse/MAHOUT-1288
We should consider changing issue type (currently - bug).

One might find this Berlin Buzzwords 2013
recordinghttp://www.youtube.com/watch?v=fWR1T2pY08Yand
slideshttp://www.slideshare.net/tdunning/buzz-wordsdunningmultimodalrecommendationof
Ted's talk on the subject helpful to understand the terms used and
idea.

I guess we could start with single kind of interaction/behavior, and
consider adding more later.

Shall we make it separate subproject (so on level of mahout and site, but
still under mahout svn), or make a new mahout submodule, or change mahout
examples from single module to a multimodule structure and add the
recommender demo as submodule there?

I'm fine with Maven tasks, to some extent Solr too (not the most recent
versions, but I see it as nice opportunity to update).

Kind regards,
Stevo Slavic.


On Sun, Jul 21, 2013 at 12:15 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 To kick this off, I have created a design document that is open for
 comments.  Much detail is needed here.  I will create a JIRA as well, but
 the google doc is much easier for collating lots of input into a coherent
 document.

 The directory that the document is stored in is accessible at

 http://  bit.ly/18vbbaT http://bit.ly/18vbbaT

 Once we get going, we can talk about how to coordinate tasks between
 hangouts.  One option is a public Trello project: https://trello.com/ or
 we
 can use JIRA sub-tasks.


 On Sat, Jul 20, 2013 at 11:25 AM, Andrew Psaltis 
 andrew.psal...@webtrends.com wrote:

  I am very interested in collaborating on the off-line to Solr part. Just
  let me know how we want to get going.
 
  Thanks,
  Andrew
 
 
 
 
 
  On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  OK.  I think the crux here is the off-line to Solr part so let's see who
  else pops up.
  
  Having a solr maven could be very helpful.
  
  
  On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo 
  lcguerreroc...@gmail.com wrote:
  
   I'm currently working for a portal that has a similar use case and I
 was
   thinking of implementing this in a similar way. I'm generating
   recommendations using python scripts based on similarity measures
  (content
   based recommendation) only using euclidean distance and some weights
 for
   each attribute. I want to use mahout's GenericItemBasedRecommender to
   generate these same recommendations without user data (no tracking
 right
   now of user to item relationship). I was thinking of pushing the
  generated
   recommendations to solr using atomic updates since my fields are all
  stored
   right now. Since this is very similar to what I'm trying to
 accomplish,
  I
   would sign up to collaborate in any way I can since I'm fairly
 familiar
   with solr and I'm starting to learn my way around mahout.
  
  
   On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org
   wrote:
  
I would also be willing to provide guidance and advice for anyone
  taking
this on, I can especially help with the offline analysis part.
   
--sebastian
   
   
2013/7/19 Ted Dunning ted.dunn...@gmail.com
   
 I would be happy to supervise a project to implement a demo of
 this
  if
 anybody is willing to do the grunt work of gluing things together.

 Sooo, if you would like to work on this, here is a suggested
  project.

 This project would entail:

 a) build a synthetic data source

 b) write scripts to do the off-line analysis

 c) write scripts to export to Solr

 d) write a very quick web facade over Solr to make it look like a
 recommendation engine.  This would include

   d.1) a most popular page that does combined popularity rise
 and
 recommendation

   d.2) a personal recommendation page that does just
  recommendation
with
 dithering

   d.3) item pages with related items at the bottom

 e) work with others to provide high quality system walk-through
 and
install
 directions

 If you want to bite on this, we should arrange a weekly video
  hangout.
I
 am willing to commit to guiding and providing detailed technical
 approaches.  You should be willing to commit to actually doing
  stuff.

 The goal would be to provide a fully worked out scaffolding of a
practical
 recommendation system that presumably would become an example
  module in
 Mahout.


 On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com
  wrote:

  +1 as well.  Sounds fun.
 
  On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner 
   cont...@dhuebner.com
  wrote:
 
   +1 for getting something like that in a future release of
 Mahout
  
   On Jul 19, 2013, at 10:02 PM, Sebastian Schelter
  s...@apache.org
 wrote:
  
It would be awesome if we could get a nice, easily
 deployable

Re: Setting up a recommender

2013-07-21 Thread Iker Huerga
Hi,

First of all, Ted, very inspiring video, I really enjoyed the concept of
cross-occurrences.

Secondly, I'd be very interested in collaborating on this project and here
is why. I've been recently working for my employer on a very similar
project that is currently deployed into our production environment.

We built a recommender system that takes instances from an ontology
identified in documents as part of an NLP process as an input, and
generates document recommendations as an output. We used a big training set
with positive and false positive matches to improve the accuracy of the
output. All these documents are indexed in Solr for which we built a
recommender RequestHandler that makes use of a RecommenderQParsePlugin we
also built for Solr.

With this we can provide recommendations to a user that is reading a
document, but in next iterations we are working towards providing
recommendations based on multiple kinds of inputs not only annotations.

This said, I would like to collaborate with you guys on the development
part of this project, just let me know how/where we can organize the user
stories and tasks.

I think a conference call, maybe a hangout, to kick off the project would
be useful, who should schedule it?

Thanks
Iker




2013/7/20 Ted Dunning ted.dunn...@gmail.com

 To kick this off, I have created a design document that is open for
 comments.  Much detail is needed here.  I will create a JIRA as well, but
 the google doc is much easier for collating lots of input into a coherent
 document.

 The directory that the document is stored in is accessible at

 http://  bit.ly/18vbbaT http://bit.ly/18vbbaT

 Once we get going, we can talk about how to coordinate tasks between
 hangouts.  One option is a public Trello project: https://trello.com/ or
 we
 can use JIRA sub-tasks.


 On Sat, Jul 20, 2013 at 11:25 AM, Andrew Psaltis 
 andrew.psal...@webtrends.com wrote:

  I am very interested in collaborating on the off-line to Solr part. Just
  let me know how we want to get going.
 
  Thanks,
  Andrew
 
 
 
 
 
  On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  OK.  I think the crux here is the off-line to Solr part so let's see who
  else pops up.
  
  Having a solr maven could be very helpful.
  
  
  On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo 
  lcguerreroc...@gmail.com wrote:
  
   I'm currently working for a portal that has a similar use case and I
 was
   thinking of implementing this in a similar way. I'm generating
   recommendations using python scripts based on similarity measures
  (content
   based recommendation) only using euclidean distance and some weights
 for
   each attribute. I want to use mahout's GenericItemBasedRecommender to
   generate these same recommendations without user data (no tracking
 right
   now of user to item relationship). I was thinking of pushing the
  generated
   recommendations to solr using atomic updates since my fields are all
  stored
   right now. Since this is very similar to what I'm trying to
 accomplish,
  I
   would sign up to collaborate in any way I can since I'm fairly
 familiar
   with solr and I'm starting to learn my way around mahout.
  
  
   On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org
   wrote:
  
I would also be willing to provide guidance and advice for anyone
  taking
this on, I can especially help with the offline analysis part.
   
--sebastian
   
   
2013/7/19 Ted Dunning ted.dunn...@gmail.com
   
 I would be happy to supervise a project to implement a demo of
 this
  if
 anybody is willing to do the grunt work of gluing things together.

 Sooo, if you would like to work on this, here is a suggested
  project.

 This project would entail:

 a) build a synthetic data source

 b) write scripts to do the off-line analysis

 c) write scripts to export to Solr

 d) write a very quick web facade over Solr to make it look like a
 recommendation engine.  This would include

   d.1) a most popular page that does combined popularity rise
 and
 recommendation

   d.2) a personal recommendation page that does just
  recommendation
with
 dithering

   d.3) item pages with related items at the bottom

 e) work with others to provide high quality system walk-through
 and
install
 directions

 If you want to bite on this, we should arrange a weekly video
  hangout.
I
 am willing to commit to guiding and providing detailed technical
 approaches.  You should be willing to commit to actually doing
  stuff.

 The goal would be to provide a fully worked out scaffolding of a
practical
 recommendation system that presumably would become an example
  module in
 Mahout.


 On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com
  wrote:

  +1 as well.  Sounds fun.
 
  On Fri, Jul 19, 2013 at 4:06 

Re: Setting up a recommender

2013-07-21 Thread Pat Ferrel
Read the paper, and the preso.

As to the 'offline to Solr' part. It sounds like you are suggesting an item 
item similarity matrix be stored and indexed in Solr. One would have to create 
the action matrix from user profile data (preference history), do a 
rowsimiarity job on it (using LLR similarity) and move the result to Solr. The 
first part of this is nearly identical to the current recommender job workflow 
and could pretty easily be created from it I think. The new part is taking the 
DistributedRowMatrix and storing it in a particular way in Solr, right?

BTW Is there some reason not to use an existing real data set?

On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

OK.  I think the crux here is the off-line to Solr part so let's see who
else pops up.

Having a solr maven could be very helpful.




Re: Setting up a recommender

2013-07-21 Thread B Lyon
Paper and presentation are very interesting to me as well.  I am fairly new
to this, and coming to terms with some of the terms, etc.  I assume that
action matrix here is just the raw matrix of how each user has
interacted with the items/types-of-items.  I didn't quite get the
incorporation into SOLR (not familiar with that much, either), in
particular the indexing related to the generated (root LLR-based?)
co-occurence matrices for the different types of things so that it can be
used in searches - so, a real newbie question: how can the co-occurence
matrix be implemented as a search index in SOLR?  Just point me at the RTFM
docs is fine :)

On Sun, Jul 21, 2013 at 5:17 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Read the paper, and the preso.

 As to the 'offline to Solr' part. It sounds like you are suggesting an
 item item similarity matrix be stored and indexed in Solr. One would have
 to create the action matrix from user profile data (preference history), do
 a rowsimiarity job on it (using LLR similarity) and move the result to
 Solr. The first part of this is nearly identical to the current recommender
 job workflow and could pretty easily be created from it I think. The new
 part is taking the DistributedRowMatrix and storing it in a particular way
 in Solr, right?

 BTW Is there some reason not to use an existing real data set?

 On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 OK.  I think the crux here is the off-line to Solr part so let's see who
 else pops up.

 Having a solr maven could be very helpful.





-- 
BF Lyon
http://www.nowherenearithaca.com


Re: Setting up a recommender

2013-07-21 Thread Ted Dunning
Pat,

Yes.  The first part probably just is the RowSimilarity job, especially
after Sebastian puts in the down-sampling.

The new part is exactly as you say, storing the DRM into Solr indexes.

There is no reason to not use a real data set.  There is a strong reason to
use a synthetic dataset, however, in that it can be trivially scaled up and
down both in items and users.  Also, the synthetic dataset doesn't require
that the real data be found and downloaded.



On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Read the paper, and the preso.

 As to the 'offline to Solr' part. It sounds like you are suggesting an
 item item similarity matrix be stored and indexed in Solr. One would have
 to create the action matrix from user profile data (preference history), do
 a rowsimiarity job on it (using LLR similarity) and move the result to
 Solr. The first part of this is nearly identical to the current recommender
 job workflow and could pretty easily be created from it I think. The new
 part is taking the DistributedRowMatrix and storing it in a particular way
 in Solr, right?

 BTW Is there some reason not to use an existing real data set?

 On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 OK.  I think the crux here is the off-line to Solr part so let's see who
 else pops up.

 Having a solr maven could be very helpful.





Re: Setting up a recommender

2013-07-21 Thread Ted Dunning
On Sun, Jul 21, 2013 at 8:10 AM, Iker Huerga iker.hue...@gmail.com wrote:

 I think a conference call, maybe a hangout, to kick off the project would
 be useful, who should schedule it?


I will shortly do that.

I think that I will need more than one kickoff to deal with timezones.  I
will coordinate these ahead of time on the mailing list.  Due to the
limitations[1] of Google hangouts with regard to saving and scheduling
ahead of time, I will only be able to get the actual URL just shortly
before the scheduled time.  I will mail that to the mailing list and also
put the URL into the shared design directory, probably in a spreadsheet.
 The meetings will be visible on Youtube afterwards.


[1] the problem here is that I have been able to schedule a hangout, but
not save that hangout to youtube.  I have also been able to save an
unscheduled meeetup, but was unable to figure out how to get a URL for such
a hangout ahead of time.  This may have changed, but I will still work
around it this time to be sure we will succeed.


Re: Setting up a recommender

2013-07-21 Thread Pat Ferrel
RowSimilarity downsampling? Are you referring to the a mod of the matrix 
multiply to do cross similarity with LLR for the cross recommendations? So 
similarity of rows of B with rows of A?

Sounds like you are proposing not only putting a recommender in Solr but also a 
cross-recommender? This is why getting a real data set is problematic?

On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Pat,

Yes.  The first part probably just is the RowSimilarity job, especially
after Sebastian puts in the down-sampling.

The new part is exactly as you say, storing the DRM into Solr indexes.

There is no reason to not use a real data set.  There is a strong reason to
use a synthetic dataset, however, in that it can be trivially scaled up and
down both in items and users.  Also, the synthetic dataset doesn't require
that the real data be found and downloaded.



On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Read the paper, and the preso.
 
 As to the 'offline to Solr' part. It sounds like you are suggesting an
 item item similarity matrix be stored and indexed in Solr. One would have
 to create the action matrix from user profile data (preference history), do
 a rowsimiarity job on it (using LLR similarity) and move the result to
 Solr. The first part of this is nearly identical to the current recommender
 job workflow and could pretty easily be created from it I think. The new
 part is taking the DistributedRowMatrix and storing it in a particular way
 in Solr, right?
 
 BTW Is there some reason not to use an existing real data set?
 
 On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 OK.  I think the crux here is the off-line to Solr part so let's see who
 else pops up.
 
 Having a solr maven could be very helpful.
 
 
 



Re: Setting up a recommender

2013-07-21 Thread Ted Dunning
The row similarity downsampling is just a matter of dropping elements at
random from rows that have more data than we want.

If the join that puts the row together can handle two kinds of input, then
RowSimilarity can be easily modified to be CrossRowSimilarity.  Likewise,
if we have two DRM's with the same row id's in the same order, we can do a
map-side merge.  Such a merge can be very efficient on a system like MapR
where you can control files to live on the same nodes.


On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 RowSimilarity downsampling? Are you referring to the a mod of the matrix
 multiply to do cross similarity with LLR for the cross recommendations? So
 similarity of rows of B with rows of A?

 Sounds like you are proposing not only putting a recommender in Solr but
 also a cross-recommender? This is why getting a real data set is
 problematic?

 On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Pat,

 Yes.  The first part probably just is the RowSimilarity job, especially
 after Sebastian puts in the down-sampling.

 The new part is exactly as you say, storing the DRM into Solr indexes.

 There is no reason to not use a real data set.  There is a strong reason to
 use a synthetic dataset, however, in that it can be trivially scaled up and
 down both in items and users.  Also, the synthetic dataset doesn't require
 that the real data be found and downloaded.



 On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote:

  Read the paper, and the preso.
 
  As to the 'offline to Solr' part. It sounds like you are suggesting an
  item item similarity matrix be stored and indexed in Solr. One would have
  to create the action matrix from user profile data (preference history),
 do
  a rowsimiarity job on it (using LLR similarity) and move the result to
  Solr. The first part of this is nearly identical to the current
 recommender
  job workflow and could pretty easily be created from it I think. The new
  part is taking the DistributedRowMatrix and storing it in a particular
 way
  in Solr, right?
 
  BTW Is there some reason not to use an existing real data set?
 
  On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  OK.  I think the crux here is the off-line to Solr part so let's see who
  else pops up.
 
  Having a solr maven could be very helpful.
 
 
 




Re: Setting up a recommender

2013-07-21 Thread Sebastian Schelter
At the moment, the down sampling is done by PreparePreferenceMatrixJob
for the collaborative filtering functionality. We just want to move it
down to RowSimilarityJob to enable standalone usage.

I think that the CrossRecommender should be the next thing on our
agenda, after we have the deployment infrastructure.  I especially like
that its capable to include different kinds of interactions, as opposed
to most other (academically motivated) recommenders that focus on a
single interaction type like a rating.

--sebastian

On 22.07.2013 02:14, Ted Dunning wrote:
 The row similarity downsampling is just a matter of dropping elements at
 random from rows that have more data than we want.
 
 If the join that puts the row together can handle two kinds of input, then
 RowSimilarity can be easily modified to be CrossRowSimilarity.  Likewise,
 if we have two DRM's with the same row id's in the same order, we can do a
 map-side merge.  Such a merge can be very efficient on a system like MapR
 where you can control files to live on the same nodes.
 
 
 On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 RowSimilarity downsampling? Are you referring to the a mod of the matrix
 multiply to do cross similarity with LLR for the cross recommendations? So
 similarity of rows of B with rows of A?

 Sounds like you are proposing not only putting a recommender in Solr but
 also a cross-recommender? This is why getting a real data set is
 problematic?

 On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Pat,

 Yes.  The first part probably just is the RowSimilarity job, especially
 after Sebastian puts in the down-sampling.

 The new part is exactly as you say, storing the DRM into Solr indexes.

 There is no reason to not use a real data set.  There is a strong reason to
 use a synthetic dataset, however, in that it can be trivially scaled up and
 down both in items and users.  Also, the synthetic dataset doesn't require
 that the real data be found and downloaded.



 On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote:

 Read the paper, and the preso.

 As to the 'offline to Solr' part. It sounds like you are suggesting an
 item item similarity matrix be stored and indexed in Solr. One would have
 to create the action matrix from user profile data (preference history),
 do
 a rowsimiarity job on it (using LLR similarity) and move the result to
 Solr. The first part of this is nearly identical to the current
 recommender
 job workflow and could pretty easily be created from it I think. The new
 part is taking the DistributedRowMatrix and storing it in a particular
 way
 in Solr, right?

 BTW Is there some reason not to use an existing real data set?

 On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 OK.  I think the crux here is the off-line to Solr part so let's see who
 else pops up.

 Having a solr maven could be very helpful.





 



Re: Setting up a recommender

2013-07-20 Thread Manuel Blechschmidt
Hello,
if there is a high demand for this functionality my company 
(http://www.apaxo.de/us/recitems.html) could implement this. Nevertheless we 
can't do it for free. So if it is possible to get a shared budget from 
everybody who is interested in this then it would be possible to write it.

The codehaus JIRA has an incentive functionality:
https://secure.donay.com/site/index

Perhaps this might also be useful for the Mahout (a.k.a. Apache) JIRA.

/Manuel

Am 20.07.2013 um 00:45 schrieb Ted Dunning:

 OK.  I think the crux here is the off-line to Solr part so let's see who
 else pops up.
 
 Having a solr maven could be very helpful.
 
 
 On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo 
 lcguerreroc...@gmail.com wrote:
 
 I'm currently working for a portal that has a similar use case and I was
 thinking of implementing this in a similar way. I'm generating
 recommendations using python scripts based on similarity measures (content
 based recommendation) only using euclidean distance and some weights for
 each attribute. I want to use mahout's GenericItemBasedRecommender to
 generate these same recommendations without user data (no tracking right
 now of user to item relationship). I was thinking of pushing the generated
 recommendations to solr using atomic updates since my fields are all stored
 right now. Since this is very similar to what I'm trying to accomplish, I
 would sign up to collaborate in any way I can since I'm fairly familiar
 with solr and I'm starting to learn my way around mahout.
 
 
 On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org
 wrote:
 
 I would also be willing to provide guidance and advice for anyone taking
 this on, I can especially help with the offline analysis part.
 
 --sebastian
 
 
 2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
 I would be happy to supervise a project to implement a demo of this if
 anybody is willing to do the grunt work of gluing things together.
 
 Sooo, if you would like to work on this, here is a suggested project.
 
 This project would entail:
 
 a) build a synthetic data source
 
 b) write scripts to do the off-line analysis
 
 c) write scripts to export to Solr
 
 d) write a very quick web facade over Solr to make it look like a
 recommendation engine.  This would include
 
  d.1) a most popular page that does combined popularity rise and
 recommendation
 
  d.2) a personal recommendation page that does just recommendation
 with
 dithering
 
  d.3) item pages with related items at the bottom
 
 e) work with others to provide high quality system walk-through and
 install
 directions
 
 If you want to bite on this, we should arrange a weekly video hangout.
 I
 am willing to commit to guiding and providing detailed technical
 approaches.  You should be willing to commit to actually doing stuff.
 
 The goal would be to provide a fully worked out scaffolding of a
 practical
 recommendation system that presumably would become an example module in
 Mahout.
 
 
 On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote:
 
 +1 as well.  Sounds fun.
 
 On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner 
 cont...@dhuebner.com
 wrote:
 
 +1 for getting something like that in a future release of Mahout
 
 On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org
 wrote:
 
 It would be awesome if we could get a nice, easily deployable
 implementation of that approach into Mahout before 1.0
 
 
 2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
 My current advice is to use Hadoop (if necessary) to build a
 sparse
 item-item matrix based on each kind of behavior you have and
 then
 drop
 those similarities into a search engine to deliver the actual
 recommendations.  This allows lots of flexibility in terms of
 which
 kinds
 of inputs you use for the recommendation and lets you blend
 recommendations
 with search and geo-location.
 
 
 On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
 helder.ga...@corp.terra.com.br wrote:
 
 Hi,
 I'm a dev working for a web portal in Brazil and I'm
 particularly
 interested in building a item-based collaborative filtering
 recommender
 for our database of news articles.
 After some coding, I was able to get some recommendations
 using a
 GenericItemBasedRecommender, a CassandraDataModel and some
 custom
 classes that store item similarities and migrated item IDs into
 Cassandra. But know I'm in doubt of what is normally done with
 this
 recommender: Should I run this as a daemon, cache the
 recommendations
 into memory and set up a web service to consult it online?
 Should I
 pre
 process these recommendations for each recent user and store it
 somewhere? My first idea was storing all these recs back into
 Cassandra,
 but looking into some classes it seems to me that the norm is
 to
 read
 the input data and store the output always using files. Is
 this a
 common
 practice that benefits from HDFS?
 My use case here is something around 70k recommendations
 requests
 per
 second.
 
 Thanks in 

Re: Setting up a recommender

2013-07-20 Thread Andrew Psaltis
I am very interested in collaborating on the off-line to Solr part. Just
let me know how we want to get going.

Thanks,
Andrew





On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

OK.  I think the crux here is the off-line to Solr part so let's see who
else pops up.

Having a solr maven could be very helpful.


On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo 
lcguerreroc...@gmail.com wrote:

 I'm currently working for a portal that has a similar use case and I was
 thinking of implementing this in a similar way. I'm generating
 recommendations using python scripts based on similarity measures
(content
 based recommendation) only using euclidean distance and some weights for
 each attribute. I want to use mahout's GenericItemBasedRecommender to
 generate these same recommendations without user data (no tracking right
 now of user to item relationship). I was thinking of pushing the
generated
 recommendations to solr using atomic updates since my fields are all
stored
 right now. Since this is very similar to what I'm trying to accomplish,
I
 would sign up to collaborate in any way I can since I'm fairly familiar
 with solr and I'm starting to learn my way around mahout.


 On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org
 wrote:

  I would also be willing to provide guidance and advice for anyone
taking
  this on, I can especially help with the offline analysis part.
 
  --sebastian
 
 
  2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
   I would be happy to supervise a project to implement a demo of this
if
   anybody is willing to do the grunt work of gluing things together.
  
   Sooo, if you would like to work on this, here is a suggested
project.
  
   This project would entail:
  
   a) build a synthetic data source
  
   b) write scripts to do the off-line analysis
  
   c) write scripts to export to Solr
  
   d) write a very quick web facade over Solr to make it look like a
   recommendation engine.  This would include
  
 d.1) a most popular page that does combined popularity rise and
   recommendation
  
 d.2) a personal recommendation page that does just
recommendation
  with
   dithering
  
 d.3) item pages with related items at the bottom
  
   e) work with others to provide high quality system walk-through and
  install
   directions
  
   If you want to bite on this, we should arrange a weekly video
hangout.
  I
   am willing to commit to guiding and providing detailed technical
   approaches.  You should be willing to commit to actually doing
stuff.
  
   The goal would be to provide a fully worked out scaffolding of a
  practical
   recommendation system that presumably would become an example
module in
   Mahout.
  
  
   On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote:
  
+1 as well.  Sounds fun.
   
On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner 
 cont...@dhuebner.com
wrote:
   
 +1 for getting something like that in a future release of Mahout

 On Jul 19, 2013, at 10:02 PM, Sebastian Schelter
s...@apache.org
   wrote:

  It would be awesome if we could get a nice, easily deployable
  implementation of that approach into Mahout before 1.0
 
 
  2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
  My current advice is to use Hadoop (if necessary) to build a
  sparse
  item-item matrix based on each kind of behavior you have and
 then
   drop
  those similarities into a search engine to deliver the actual
  recommendations.  This allows lots of flexibility in terms of
  which
 kinds
  of inputs you use for the recommendation and lets you blend
 recommendations
  with search and geo-location.
 
 
  On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
  helder.ga...@corp.terra.com.br wrote:
 
  Hi,
  I'm a dev working for a web portal in Brazil and I'm
 particularly
  interested in building a item-based collaborative filtering
recommender
  for our database of news articles.
  After some coding, I was able to get some recommendations
 using a
  GenericItemBasedRecommender, a CassandraDataModel and some
 custom
  classes that store item similarities and migrated item IDs
into
  Cassandra. But know I'm in doubt of what is normally done
with
  this
  recommender: Should I run this as a daemon, cache the
   recommendations
  into memory and set up a web service to consult it online?
  Should I
pre
  process these recommendations for each recent user and
store it
  somewhere? My first idea was storing all these recs back
into
 Cassandra,
  but looking into some classes it seems to me that the norm
is
 to
   read
  the input data and store the output always using files. Is
 this a
 common
  practice that benefits from HDFS?
  My use case here is something around 70k recommendations
 requests
   per
  second.
 
  Thanks in advance,
 
   

Re: Setting up a recommender

2013-07-20 Thread Ted Dunning
To kick this off, I have created a design document that is open for
comments.  Much detail is needed here.  I will create a JIRA as well, but
the google doc is much easier for collating lots of input into a coherent
document.

The directory that the document is stored in is accessible at

http://  bit.ly/18vbbaT http://bit.ly/18vbbaT

Once we get going, we can talk about how to coordinate tasks between
hangouts.  One option is a public Trello project: https://trello.com/ or we
can use JIRA sub-tasks.


On Sat, Jul 20, 2013 at 11:25 AM, Andrew Psaltis 
andrew.psal...@webtrends.com wrote:

 I am very interested in collaborating on the off-line to Solr part. Just
 let me know how we want to get going.

 Thanks,
 Andrew





 On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 OK.  I think the crux here is the off-line to Solr part so let's see who
 else pops up.
 
 Having a solr maven could be very helpful.
 
 
 On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo 
 lcguerreroc...@gmail.com wrote:
 
  I'm currently working for a portal that has a similar use case and I was
  thinking of implementing this in a similar way. I'm generating
  recommendations using python scripts based on similarity measures
 (content
  based recommendation) only using euclidean distance and some weights for
  each attribute. I want to use mahout's GenericItemBasedRecommender to
  generate these same recommendations without user data (no tracking right
  now of user to item relationship). I was thinking of pushing the
 generated
  recommendations to solr using atomic updates since my fields are all
 stored
  right now. Since this is very similar to what I'm trying to accomplish,
 I
  would sign up to collaborate in any way I can since I'm fairly familiar
  with solr and I'm starting to learn my way around mahout.
 
 
  On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org
  wrote:
 
   I would also be willing to provide guidance and advice for anyone
 taking
   this on, I can especially help with the offline analysis part.
  
   --sebastian
  
  
   2013/7/19 Ted Dunning ted.dunn...@gmail.com
  
I would be happy to supervise a project to implement a demo of this
 if
anybody is willing to do the grunt work of gluing things together.
   
Sooo, if you would like to work on this, here is a suggested
 project.
   
This project would entail:
   
a) build a synthetic data source
   
b) write scripts to do the off-line analysis
   
c) write scripts to export to Solr
   
d) write a very quick web facade over Solr to make it look like a
recommendation engine.  This would include
   
  d.1) a most popular page that does combined popularity rise and
recommendation
   
  d.2) a personal recommendation page that does just
 recommendation
   with
dithering
   
  d.3) item pages with related items at the bottom
   
e) work with others to provide high quality system walk-through and
   install
directions
   
If you want to bite on this, we should arrange a weekly video
 hangout.
   I
am willing to commit to guiding and providing detailed technical
approaches.  You should be willing to commit to actually doing
 stuff.
   
The goal would be to provide a fully worked out scaffolding of a
   practical
recommendation system that presumably would become an example
 module in
Mahout.
   
   
On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com
 wrote:
   
 +1 as well.  Sounds fun.

 On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner 
  cont...@dhuebner.com
 wrote:

  +1 for getting something like that in a future release of Mahout
 
  On Jul 19, 2013, at 10:02 PM, Sebastian Schelter
 s...@apache.org
wrote:
 
   It would be awesome if we could get a nice, easily deployable
   implementation of that approach into Mahout before 1.0
  
  
   2013/7/19 Ted Dunning ted.dunn...@gmail.com
  
   My current advice is to use Hadoop (if necessary) to build a
   sparse
   item-item matrix based on each kind of behavior you have and
  then
drop
   those similarities into a search engine to deliver the actual
   recommendations.  This allows lots of flexibility in terms of
   which
  kinds
   of inputs you use for the recommendation and lets you blend
  recommendations
   with search and geo-location.
  
  
   On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
   helder.ga...@corp.terra.com.br wrote:
  
   Hi,
   I'm a dev working for a web portal in Brazil and I'm
  particularly
   interested in building a item-based collaborative filtering
 recommender
   for our database of news articles.
   After some coding, I was able to get some recommendations
  using a
   GenericItemBasedRecommender, a CassandraDataModel and some
  custom
   classes that store item similarities and migrated item IDs

Re: Setting up a recommender

2013-07-19 Thread Ted Dunning
My current advice is to use Hadoop (if necessary) to build a sparse
item-item matrix based on each kind of behavior you have and then drop
those similarities into a search engine to deliver the actual
recommendations.  This allows lots of flexibility in terms of which kinds
of inputs you use for the recommendation and lets you blend recommendations
with search and geo-location.


On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
helder.ga...@corp.terra.com.br wrote:

 Hi,
 I'm a dev working for a web portal in Brazil and I'm particularly
 interested in building a item-based collaborative filtering recommender
 for our database of news articles.
 After some coding, I was able to get some recommendations using a
 GenericItemBasedRecommender, a CassandraDataModel and some custom
 classes that store item similarities and migrated item IDs into
 Cassandra. But know I'm in doubt of what is normally done with this
 recommender: Should I run this as a daemon, cache the recommendations
 into memory and set up a web service to consult it online? Should I pre
 process these recommendations for each recent user and store it
 somewhere? My first idea was storing all these recs back into Cassandra,
 but looking into some classes it seems to me that the norm is to read
 the input data and store the output always using files. Is this a common
 practice that benefits from HDFS?
 My use case here is something around 70k recommendations requests per
 second.

 Thanks in advance,

 --

 Atenciosamente
 Helder Martins
 Arquitetura do Portal e Sistemas de Backend
 +55 (51) 3284-4475
 Terra


 Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
 podem conter informação privilegiada ou confidencial e são de uso exclusivo
 da pessoa ou entidade de destino. Se não for destinatário desta mensagem,
 fica notificado de que a leitura, utilização, divulgação e/ou cópia sem
 autorização pode estar proibida em virtude da legislação vigente. Se
 recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente
 por esta mesma via e, em seguida, apague-a.

 Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
 puede contener información privilegiada o confidencial y es para uso
 exclusivo de la persona o entidad de destino. Si no es usted él
 destinatario indicado, queda notificado de que la lectura, utilización,
 divulgación y/o copia sin autorización puede estar prohibida en virtud de
 la legislación vigente. Si ha recibido este mensaje por error, le pedimos
 que nos lo comunique inmediatamente por esta misma vía y proceda a su
 exclusión.

 The information contained in this transmissión is privileged and
 confidential information intended only for the use of the individual or
 entity named above. If the reader of this message is not the intended
 recipient, you are hereby notified that any dissemination, distribution or
 copying of this communication is strictly prohibited. If you have received
 this transmission in error, do not read it. Please immediately reply to the
 sender that you have received this communication in error and then delete
 it.



Re: Setting up a recommender

2013-07-19 Thread Sebastian Schelter
It would be awesome if we could get a nice, easily deployable
implementation of that approach into Mahout before 1.0


2013/7/19 Ted Dunning ted.dunn...@gmail.com

 My current advice is to use Hadoop (if necessary) to build a sparse
 item-item matrix based on each kind of behavior you have and then drop
 those similarities into a search engine to deliver the actual
 recommendations.  This allows lots of flexibility in terms of which kinds
 of inputs you use for the recommendation and lets you blend recommendations
 with search and geo-location.


 On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
 helder.ga...@corp.terra.com.br wrote:

  Hi,
  I'm a dev working for a web portal in Brazil and I'm particularly
  interested in building a item-based collaborative filtering recommender
  for our database of news articles.
  After some coding, I was able to get some recommendations using a
  GenericItemBasedRecommender, a CassandraDataModel and some custom
  classes that store item similarities and migrated item IDs into
  Cassandra. But know I'm in doubt of what is normally done with this
  recommender: Should I run this as a daemon, cache the recommendations
  into memory and set up a web service to consult it online? Should I pre
  process these recommendations for each recent user and store it
  somewhere? My first idea was storing all these recs back into Cassandra,
  but looking into some classes it seems to me that the norm is to read
  the input data and store the output always using files. Is this a common
  practice that benefits from HDFS?
  My use case here is something around 70k recommendations requests per
  second.
 
  Thanks in advance,
 
  --
 
  Atenciosamente
  Helder Martins
  Arquitetura do Portal e Sistemas de Backend
  +55 (51) 3284-4475
  Terra
 
 
  Esta mensagem e seus anexos se dirigem exclusivamente ao seu
 destinatário,
  podem conter informação privilegiada ou confidencial e são de uso
 exclusivo
  da pessoa ou entidade de destino. Se não for destinatário desta mensagem,
  fica notificado de que a leitura, utilização, divulgação e/ou cópia sem
  autorização pode estar proibida em virtude da legislação vigente. Se
  recebeu esta mensagem por engano, pedimos que nos o comunique
 imediatamente
  por esta mesma via e, em seguida, apague-a.
 
  Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
  puede contener información privilegiada o confidencial y es para uso
  exclusivo de la persona o entidad de destino. Si no es usted él
  destinatario indicado, queda notificado de que la lectura, utilización,
  divulgación y/o copia sin autorización puede estar prohibida en virtud de
  la legislación vigente. Si ha recibido este mensaje por error, le pedimos
  que nos lo comunique inmediatamente por esta misma vía y proceda a su
  exclusión.
 
  The information contained in this transmissión is privileged and
  confidential information intended only for the use of the individual or
  entity named above. If the reader of this message is not the intended
  recipient, you are hereby notified that any dissemination, distribution
 or
  copying of this communication is strictly prohibited. If you have
 received
  this transmission in error, do not read it. Please immediately reply to
 the
  sender that you have received this communication in error and then delete
  it.
 



Re: Setting up a recommender

2013-07-19 Thread B Lyon
+1 as well.  Sounds fun.

On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.comwrote:

 +1 for getting something like that in a future release of Mahout

 On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote:

  It would be awesome if we could get a nice, easily deployable
  implementation of that approach into Mahout before 1.0
 
 
  2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
  My current advice is to use Hadoop (if necessary) to build a sparse
  item-item matrix based on each kind of behavior you have and then drop
  those similarities into a search engine to deliver the actual
  recommendations.  This allows lots of flexibility in terms of which
 kinds
  of inputs you use for the recommendation and lets you blend
 recommendations
  with search and geo-location.
 
 
  On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
  helder.ga...@corp.terra.com.br wrote:
 
  Hi,
  I'm a dev working for a web portal in Brazil and I'm particularly
  interested in building a item-based collaborative filtering recommender
  for our database of news articles.
  After some coding, I was able to get some recommendations using a
  GenericItemBasedRecommender, a CassandraDataModel and some custom
  classes that store item similarities and migrated item IDs into
  Cassandra. But know I'm in doubt of what is normally done with this
  recommender: Should I run this as a daemon, cache the recommendations
  into memory and set up a web service to consult it online? Should I pre
  process these recommendations for each recent user and store it
  somewhere? My first idea was storing all these recs back into
 Cassandra,
  but looking into some classes it seems to me that the norm is to read
  the input data and store the output always using files. Is this a
 common
  practice that benefits from HDFS?
  My use case here is something around 70k recommendations requests per
  second.
 
  Thanks in advance,
 
  --
 
  Atenciosamente
  Helder Martins
  Arquitetura do Portal e Sistemas de Backend
  +55 (51) 3284-4475
  Terra
 
 
  Esta mensagem e seus anexos se dirigem exclusivamente ao seu
  destinatário,
  podem conter informação privilegiada ou confidencial e são de uso
  exclusivo
  da pessoa ou entidade de destino. Se não for destinatário desta
 mensagem,
  fica notificado de que a leitura, utilização, divulgação e/ou cópia sem
  autorização pode estar proibida em virtude da legislação vigente. Se
  recebeu esta mensagem por engano, pedimos que nos o comunique
  imediatamente
  por esta mesma via e, em seguida, apague-a.
 
  Este mensaje y sus adjuntos se dirigen exclusivamente a su
 destinatario,
  puede contener información privilegiada o confidencial y es para uso
  exclusivo de la persona o entidad de destino. Si no es usted él
  destinatario indicado, queda notificado de que la lectura, utilización,
  divulgación y/o copia sin autorización puede estar prohibida en virtud
 de
  la legislación vigente. Si ha recibido este mensaje por error, le
 pedimos
  que nos lo comunique inmediatamente por esta misma vía y proceda a su
  exclusión.
 
  The information contained in this transmissión is privileged and
  confidential information intended only for the use of the individual or
  entity named above. If the reader of this message is not the intended
  recipient, you are hereby notified that any dissemination, distribution
  or
  copying of this communication is strictly prohibited. If you have
  received
  this transmission in error, do not read it. Please immediately reply to
  the
  sender that you have received this communication in error and then
 delete
  it.
 
 




-- 
BF Lyon
http://www.nowherenearithaca.com


Re: Setting up a recommender

2013-07-19 Thread Dmitriy Lyubimov
On Fri, Jul 19, 2013 at 12:59 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 My current advice is to use Hadoop (if necessary) to build a sparse
 item-item matrix based on each kind of behavior you have and then drop
 those similarities into a search engine

you mean like Lucene / Katta?


 to deliver the actual
 recommendations.  This allows lots of flexibility in terms of which kinds
 of inputs you use for the recommendation and lets you blend recommendations
 with search and geo-location.


 On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
 helder.ga...@corp.terra.com.br wrote:

  Hi,
  I'm a dev working for a web portal in Brazil and I'm particularly
  interested in building a item-based collaborative filtering recommender
  for our database of news articles.
  After some coding, I was able to get some recommendations using a
  GenericItemBasedRecommender, a CassandraDataModel and some custom
  classes that store item similarities and migrated item IDs into
  Cassandra. But know I'm in doubt of what is normally done with this
  recommender: Should I run this as a daemon, cache the recommendations
  into memory and set up a web service to consult it online? Should I pre
  process these recommendations for each recent user and store it
  somewhere? My first idea was storing all these recs back into Cassandra,
  but looking into some classes it seems to me that the norm is to read
  the input data and store the output always using files. Is this a common
  practice that benefits from HDFS?
  My use case here is something around 70k recommendations requests per
  second.
 
  Thanks in advance,
 
  --
 
  Atenciosamente
  Helder Martins
  Arquitetura do Portal e Sistemas de Backend
  +55 (51) 3284-4475
  Terra
 
 
  Esta mensagem e seus anexos se dirigem exclusivamente ao seu
 destinatário,
  podem conter informação privilegiada ou confidencial e são de uso
 exclusivo
  da pessoa ou entidade de destino. Se não for destinatário desta mensagem,
  fica notificado de que a leitura, utilização, divulgação e/ou cópia sem
  autorização pode estar proibida em virtude da legislação vigente. Se
  recebeu esta mensagem por engano, pedimos que nos o comunique
 imediatamente
  por esta mesma via e, em seguida, apague-a.
 
  Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
  puede contener información privilegiada o confidencial y es para uso
  exclusivo de la persona o entidad de destino. Si no es usted él
  destinatario indicado, queda notificado de que la lectura, utilización,
  divulgación y/o copia sin autorización puede estar prohibida en virtud de
  la legislación vigente. Si ha recibido este mensaje por error, le pedimos
  que nos lo comunique inmediatamente por esta misma vía y proceda a su
  exclusión.
 
  The information contained in this transmissión is privileged and
  confidential information intended only for the use of the individual or
  entity named above. If the reader of this message is not the intended
  recipient, you are hereby notified that any dissemination, distribution
 or
  copying of this communication is strictly prohibited. If you have
 received
  this transmission in error, do not read it. Please immediately reply to
 the
  sender that you have received this communication in error and then delete
  it.
 



Re: Setting up a recommender

2013-07-19 Thread Ted Dunning
I would be happy to supervise a project to implement a demo of this if
anybody is willing to do the grunt work of gluing things together.

Sooo, if you would like to work on this, here is a suggested project.

This project would entail:

a) build a synthetic data source

b) write scripts to do the off-line analysis

c) write scripts to export to Solr

d) write a very quick web facade over Solr to make it look like a
recommendation engine.  This would include

  d.1) a most popular page that does combined popularity rise and
recommendation

  d.2) a personal recommendation page that does just recommendation with
dithering

  d.3) item pages with related items at the bottom

e) work with others to provide high quality system walk-through and install
directions

If you want to bite on this, we should arrange a weekly video hangout.  I
am willing to commit to guiding and providing detailed technical
approaches.  You should be willing to commit to actually doing stuff.

The goal would be to provide a fully worked out scaffolding of a practical
recommendation system that presumably would become an example module in
Mahout.


On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote:

 +1 as well.  Sounds fun.

 On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com
 wrote:

  +1 for getting something like that in a future release of Mahout
 
  On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote:
 
   It would be awesome if we could get a nice, easily deployable
   implementation of that approach into Mahout before 1.0
  
  
   2013/7/19 Ted Dunning ted.dunn...@gmail.com
  
   My current advice is to use Hadoop (if necessary) to build a sparse
   item-item matrix based on each kind of behavior you have and then drop
   those similarities into a search engine to deliver the actual
   recommendations.  This allows lots of flexibility in terms of which
  kinds
   of inputs you use for the recommendation and lets you blend
  recommendations
   with search and geo-location.
  
  
   On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
   helder.ga...@corp.terra.com.br wrote:
  
   Hi,
   I'm a dev working for a web portal in Brazil and I'm particularly
   interested in building a item-based collaborative filtering
 recommender
   for our database of news articles.
   After some coding, I was able to get some recommendations using a
   GenericItemBasedRecommender, a CassandraDataModel and some custom
   classes that store item similarities and migrated item IDs into
   Cassandra. But know I'm in doubt of what is normally done with this
   recommender: Should I run this as a daemon, cache the recommendations
   into memory and set up a web service to consult it online? Should I
 pre
   process these recommendations for each recent user and store it
   somewhere? My first idea was storing all these recs back into
  Cassandra,
   but looking into some classes it seems to me that the norm is to read
   the input data and store the output always using files. Is this a
  common
   practice that benefits from HDFS?
   My use case here is something around 70k recommendations requests per
   second.
  
   Thanks in advance,
  
   --
  
   Atenciosamente
   Helder Martins
   Arquitetura do Portal e Sistemas de Backend
   +55 (51) 3284-4475
   Terra
  
  
   Esta mensagem e seus anexos se dirigem exclusivamente ao seu
   destinatário,
   podem conter informação privilegiada ou confidencial e são de uso
   exclusivo
   da pessoa ou entidade de destino. Se não for destinatário desta
  mensagem,
   fica notificado de que a leitura, utilização, divulgação e/ou cópia
 sem
   autorização pode estar proibida em virtude da legislação vigente. Se
   recebeu esta mensagem por engano, pedimos que nos o comunique
   imediatamente
   por esta mesma via e, em seguida, apague-a.
  
   Este mensaje y sus adjuntos se dirigen exclusivamente a su
  destinatario,
   puede contener información privilegiada o confidencial y es para uso
   exclusivo de la persona o entidad de destino. Si no es usted él
   destinatario indicado, queda notificado de que la lectura,
 utilización,
   divulgación y/o copia sin autorización puede estar prohibida en
 virtud
  de
   la legislación vigente. Si ha recibido este mensaje por error, le
  pedimos
   que nos lo comunique inmediatamente por esta misma vía y proceda a su
   exclusión.
  
   The information contained in this transmissión is privileged and
   confidential information intended only for the use of the individual
 or
   entity named above. If the reader of this message is not the intended
   recipient, you are hereby notified that any dissemination,
 distribution
   or
   copying of this communication is strictly prohibited. If you have
   received
   this transmission in error, do not read it. Please immediately reply
 to
   the
   sender that you have received this communication in error and then
  delete
   it.
  
  
 
 


 --
 BF Lyon
 

Re: Setting up a recommender

2013-07-19 Thread Sebastian Schelter
I would also be willing to provide guidance and advice for anyone taking
this on, I can especially help with the offline analysis part.

--sebastian


2013/7/19 Ted Dunning ted.dunn...@gmail.com

 I would be happy to supervise a project to implement a demo of this if
 anybody is willing to do the grunt work of gluing things together.

 Sooo, if you would like to work on this, here is a suggested project.

 This project would entail:

 a) build a synthetic data source

 b) write scripts to do the off-line analysis

 c) write scripts to export to Solr

 d) write a very quick web facade over Solr to make it look like a
 recommendation engine.  This would include

   d.1) a most popular page that does combined popularity rise and
 recommendation

   d.2) a personal recommendation page that does just recommendation with
 dithering

   d.3) item pages with related items at the bottom

 e) work with others to provide high quality system walk-through and install
 directions

 If you want to bite on this, we should arrange a weekly video hangout.  I
 am willing to commit to guiding and providing detailed technical
 approaches.  You should be willing to commit to actually doing stuff.

 The goal would be to provide a fully worked out scaffolding of a practical
 recommendation system that presumably would become an example module in
 Mahout.


 On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote:

  +1 as well.  Sounds fun.
 
  On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com
  wrote:
 
   +1 for getting something like that in a future release of Mahout
  
   On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org
 wrote:
  
It would be awesome if we could get a nice, easily deployable
implementation of that approach into Mahout before 1.0
   
   
2013/7/19 Ted Dunning ted.dunn...@gmail.com
   
My current advice is to use Hadoop (if necessary) to build a sparse
item-item matrix based on each kind of behavior you have and then
 drop
those similarities into a search engine to deliver the actual
recommendations.  This allows lots of flexibility in terms of which
   kinds
of inputs you use for the recommendation and lets you blend
   recommendations
with search and geo-location.
   
   
On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
helder.ga...@corp.terra.com.br wrote:
   
Hi,
I'm a dev working for a web portal in Brazil and I'm particularly
interested in building a item-based collaborative filtering
  recommender
for our database of news articles.
After some coding, I was able to get some recommendations using a
GenericItemBasedRecommender, a CassandraDataModel and some custom
classes that store item similarities and migrated item IDs into
Cassandra. But know I'm in doubt of what is normally done with this
recommender: Should I run this as a daemon, cache the
 recommendations
into memory and set up a web service to consult it online? Should I
  pre
process these recommendations for each recent user and store it
somewhere? My first idea was storing all these recs back into
   Cassandra,
but looking into some classes it seems to me that the norm is to
 read
the input data and store the output always using files. Is this a
   common
practice that benefits from HDFS?
My use case here is something around 70k recommendations requests
 per
second.
   
Thanks in advance,
   
--
   
Atenciosamente
Helder Martins
Arquitetura do Portal e Sistemas de Backend
+55 (51) 3284-4475
Terra
   
   
Esta mensagem e seus anexos se dirigem exclusivamente ao seu
destinatário,
podem conter informação privilegiada ou confidencial e são de uso
exclusivo
da pessoa ou entidade de destino. Se não for destinatário desta
   mensagem,
fica notificado de que a leitura, utilização, divulgação e/ou cópia
  sem
autorização pode estar proibida em virtude da legislação vigente.
 Se
recebeu esta mensagem por engano, pedimos que nos o comunique
imediatamente
por esta mesma via e, em seguida, apague-a.
   
Este mensaje y sus adjuntos se dirigen exclusivamente a su
   destinatario,
puede contener información privilegiada o confidencial y es para
 uso
exclusivo de la persona o entidad de destino. Si no es usted él
destinatario indicado, queda notificado de que la lectura,
  utilización,
divulgación y/o copia sin autorización puede estar prohibida en
  virtud
   de
la legislación vigente. Si ha recibido este mensaje por error, le
   pedimos
que nos lo comunique inmediatamente por esta misma vía y proceda a
 su
exclusión.
   
The information contained in this transmissión is privileged and
confidential information intended only for the use of the
 individual
  or
entity named above. If the reader of this message is not the
 intended
recipient, you are hereby notified that any dissemination,
  

Re: Setting up a recommender

2013-07-19 Thread Ted Dunning
OK.  I think the crux here is the off-line to Solr part so let's see who
else pops up.

Having a solr maven could be very helpful.


On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo 
lcguerreroc...@gmail.com wrote:

 I'm currently working for a portal that has a similar use case and I was
 thinking of implementing this in a similar way. I'm generating
 recommendations using python scripts based on similarity measures (content
 based recommendation) only using euclidean distance and some weights for
 each attribute. I want to use mahout's GenericItemBasedRecommender to
 generate these same recommendations without user data (no tracking right
 now of user to item relationship). I was thinking of pushing the generated
 recommendations to solr using atomic updates since my fields are all stored
 right now. Since this is very similar to what I'm trying to accomplish, I
 would sign up to collaborate in any way I can since I'm fairly familiar
 with solr and I'm starting to learn my way around mahout.


 On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org
 wrote:

  I would also be willing to provide guidance and advice for anyone taking
  this on, I can especially help with the offline analysis part.
 
  --sebastian
 
 
  2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
   I would be happy to supervise a project to implement a demo of this if
   anybody is willing to do the grunt work of gluing things together.
  
   Sooo, if you would like to work on this, here is a suggested project.
  
   This project would entail:
  
   a) build a synthetic data source
  
   b) write scripts to do the off-line analysis
  
   c) write scripts to export to Solr
  
   d) write a very quick web facade over Solr to make it look like a
   recommendation engine.  This would include
  
 d.1) a most popular page that does combined popularity rise and
   recommendation
  
 d.2) a personal recommendation page that does just recommendation
  with
   dithering
  
 d.3) item pages with related items at the bottom
  
   e) work with others to provide high quality system walk-through and
  install
   directions
  
   If you want to bite on this, we should arrange a weekly video hangout.
  I
   am willing to commit to guiding and providing detailed technical
   approaches.  You should be willing to commit to actually doing stuff.
  
   The goal would be to provide a fully worked out scaffolding of a
  practical
   recommendation system that presumably would become an example module in
   Mahout.
  
  
   On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote:
  
+1 as well.  Sounds fun.
   
On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner 
 cont...@dhuebner.com
wrote:
   
 +1 for getting something like that in a future release of Mahout

 On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org
   wrote:

  It would be awesome if we could get a nice, easily deployable
  implementation of that approach into Mahout before 1.0
 
 
  2013/7/19 Ted Dunning ted.dunn...@gmail.com
 
  My current advice is to use Hadoop (if necessary) to build a
  sparse
  item-item matrix based on each kind of behavior you have and
 then
   drop
  those similarities into a search engine to deliver the actual
  recommendations.  This allows lots of flexibility in terms of
  which
 kinds
  of inputs you use for the recommendation and lets you blend
 recommendations
  with search and geo-location.
 
 
  On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins 
  helder.ga...@corp.terra.com.br wrote:
 
  Hi,
  I'm a dev working for a web portal in Brazil and I'm
 particularly
  interested in building a item-based collaborative filtering
recommender
  for our database of news articles.
  After some coding, I was able to get some recommendations
 using a
  GenericItemBasedRecommender, a CassandraDataModel and some
 custom
  classes that store item similarities and migrated item IDs into
  Cassandra. But know I'm in doubt of what is normally done with
  this
  recommender: Should I run this as a daemon, cache the
   recommendations
  into memory and set up a web service to consult it online?
  Should I
pre
  process these recommendations for each recent user and store it
  somewhere? My first idea was storing all these recs back into
 Cassandra,
  but looking into some classes it seems to me that the norm is
 to
   read
  the input data and store the output always using files. Is
 this a
 common
  practice that benefits from HDFS?
  My use case here is something around 70k recommendations
 requests
   per
  second.
 
  Thanks in advance,
 
  --
 
  Atenciosamente
  Helder Martins
  Arquitetura do Portal e Sistemas de Backend
  +55 (51) 3284-4475
  Terra
 
 
  Esta mensagem e seus anexos se 

  1   2   >