Re: Setting up a recommender
Pat and Ted: I am late to the party but this is very interesting! I am not sure I understand all the steps, though. Do you still create a cooccurrence matrix and compute LLR scores during this process or do you only compute matrix multiplication times the history vector: B'B * h and B'A * h? Cheers, Frank On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a user1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links u1,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The
Re: Setting up a recommender
Yes the cooccurrence item similarity matrix is calculated using LLR using Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix these days. The indicator matrix is then translated from a SequenceFile into a CSV (or other text delimited file) which looks like a list of itemIDs—tokens or terms in Solr parlance—for each item. These documents are indexed by Solr and the query is the user history. [B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is “multiplied” by the indicator matrix by using it as the Solr query against the indicator matrix, actually producing a cosine similarity ranked list of items. You have to squint a little to see the math. Any matrix product can be substituted with a row to column similarity metric assuming dimensionality is correct. So the product in all the equations should be interpreted as such. So to get recs for a user [B’B]h is done in two phases, one calculates [B’B] and one is a Solr query that adds the ‘h’ to the equation. In this project https://github.com/pferrel/solr-recommender both [B’B] and [A’B] are calculated, the later uses actual matrix multiply, since we did not have a cross-RSJ at the time. Now that we have a cross cooccurrence in the Spark Scala Mahout 2 stuff I’ll rewrite the code to use it. The cross indicator matrix allows you to use two different actions to predict a target action. So for example views that are similar to purchases can be used to recommend purchases. Take a look at the readme on github it has a quick review of the theory. BTW there is a video recommender site that demos some interesting uses of Solr to blend collaborative filtering recs with metadata. It even makes recs based of of your most recent detail views on the site. That last doesn’t work all that well because it is really a cross recommendation and that isn’t built into the site yet. https://guide.finderbots.com On Apr 21, 2014, at 12:11 PM, Frank Scholten fr...@frankscholten.nl wrote: Pat and Ted: I am late to the party but this is very interesting! I am not sure I understand all the steps, though. Do you still create a cooccurrence matrix and compute LLR scores during this process or do you only compute matrix multiplication times the history vector: B'B * h and B'A * h? Cheers, Frank On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a user1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links u1,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning
Re: Setting up a recommender
RowSimilarityJob is the guts of the work, but ItemSimilarityJob is usually easier packaging for users. On Mon, Apr 21, 2014 at 1:00 PM, Pat Ferrel p...@occamsmachete.com wrote: Yes the cooccurrence item similarity matrix is calculated using LLR using Mahout’s RowSimilarityJob. I guess we are calling this and indicator matrix these days. The indicator matrix is then translated from a SequenceFile into a CSV (or other text delimited file) which looks like a list of itemIDs—tokens or terms in Solr parlance—for each item. These documents are indexed by Solr and the query is the user history. [B’B] is pre-calculated by RowSimilarityJob in Mahout. The user history is “multiplied” by the indicator matrix by using it as the Solr query against the indicator matrix, actually producing a cosine similarity ranked list of items. You have to squint a little to see the math. Any matrix product can be substituted with a row to column similarity metric assuming dimensionality is correct. So the product in all the equations should be interpreted as such. So to get recs for a user [B’B]h is done in two phases, one calculates [B’B] and one is a Solr query that adds the ‘h’ to the equation. In this project https://github.com/pferrel/solr-recommender both [B’B] and [A’B] are calculated, the later uses actual matrix multiply, since we did not have a cross-RSJ at the time. Now that we have a cross cooccurrence in the Spark Scala Mahout 2 stuff I’ll rewrite the code to use it. The cross indicator matrix allows you to use two different actions to predict a target action. So for example views that are similar to purchases can be used to recommend purchases. Take a look at the readme on github it has a quick review of the theory. BTW there is a video recommender site that demos some interesting uses of Solr to blend collaborative filtering recs with metadata. It even makes recs based of of your most recent detail views on the site. That last doesn’t work all that well because it is really a cross recommendation and that isn’t built into the site yet. https://guide.finderbots.com On Apr 21, 2014, at 12:11 PM, Frank Scholten fr...@frankscholten.nl wrote: Pat and Ted: I am late to the party but this is very interesting! I am not sure I understand all the steps, though. Do you still create a cooccurrence matrix and compute LLR scores during this process or do you only compute matrix multiplication times the history vector: B'B * h and B'A * h? Cheers, Frank On Tue, Aug 13, 2013 at 7:49 PM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a user1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links u1,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important
Re: Setting up a recommender
There are three things I could work on my free time: 1) test this on a bigger data set gathered from rotten tomatoes, which only has B data (movie thumbs up) 2) begin work on the Solr query and service integration, rather than the current loose LucidWorks Search integration. 3) make sure everything is set up for different item spaces in B and A. Planning to tackle in this order, unless someone speaks up. On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote: Works on a cluster but have only tested on the trivial test data set. On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote: OK single action recs are working so output to Solr with only [B'B] and B. On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote: Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a u1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links iphone,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item
Re: Setting up a recommender
Pat, That really sounds great. I should find some time (who needs sleep) to generate music logs for you as well. On Mon, Aug 19, 2013 at 8:31 AM, Pat Ferrel p...@occamsmachete.com wrote: There are three things I could work on my free time: 1) test this on a bigger data set gathered from rotten tomatoes, which only has B data (movie thumbs up) 2) begin work on the Solr query and service integration, rather than the current loose LucidWorks Search integration. 3) make sure everything is set up for different item spaces in B and A. Planning to tackle in this order, unless someone speaks up. On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote: Works on a cluster but have only tested on the trivial test data set. On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote: OK single action recs are working so output to Solr with only [B'B] and B. On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote: Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a u1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links iphone,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just
Re: Setting up a recommender
When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Setting up a recommender
I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a user1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links u1,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize
Re: Setting up a recommender
Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a u1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links iphone,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache
Re: Setting up a recommender
OK single action recs are working so output to Solr with only [B'B] and B. On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote: Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a u1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links iphone,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items
Re: Setting up a recommender
A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Setting up a recommender
Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Setting up a recommender
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Setting up a recommender
Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Setting up a recommender
Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Setting up a recommender
A note about todays hangout regarding the cross-recommender. In general it may be good way to think about the current and proposed system as two pipelines: 1) a pipeline that takes preference data, turn it into two preference matrices in Mahout DRM form and creates [B'B] and [B'A] ideally using LLR Row and CrossRowSimilairtyJobs. This generates two DRMs with Mahout Keys and VectorWritable(s) with internal numerical Mahout IDs. There is one ID space for B and one for A. In the github repo these also create recommendations in Mahout form as an Items-based RecommenderJob and XRecommenderJob. This last step is not needed when using Solr but may be useful for comparison. These Jobs are all mapreduce and closely match the Mahout code and model of calculation. 2) a pipeline that processes IDs and other metadata contained in the logs. The IDs are user IDs in string form as are the Items IDs. But the Items for A action may be completely different from B. This cross-recommender ties the two together with a generalized notion of significant cooccurrence using by executing the #1 pipeline and using the results. These log file IDs are what gets written out to Solr. Which IDs is encoded in the two Mahout generated DRMs. The pipeline may need to bring along other metadata mined from the logs like item descriptions, tags, categories, etc. Note: This is last bit is not build in at present but would make Solr queries even better. Also at present A and B are assumed to have the same item IDs. This works for purchase+view actions and other but not for some cross-actions that would be useful like music track listen + tagged category listen - track recommendation or music tagged category listen+track listen - category recommendation. The current action items are: 1) #1 is running and works but eventually needs to be reintegrated with new Mahout trunk code--my action item, with Sebastian's help. 2) #2 needs to write the merged DRMs to Solr as one doc per row and 3 fields per doc (id, B'B, B'A)--I'm is working on this now. 3) To generalize further we need to account for different ID spaces in #2 and I'll take that as an action item. 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz.
Re: Setting up a recommender
In writing the similarity matrices to Solr there is a bit of a problem. The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I know there is no guarantee that the ids of both matrices are in the same descending order. The easiest solution is to have an index for [B'B] and one for [B'A]. That means two or perhaps three queries for cross-recommendations, which is not ideal. First I'm going to create two collections of docs with different field ids--this should work and we can merge them later. Next we can do some m/r to group the docs by id so there is one collection (csv) with one line per doc. Alternatively it is a possible that the DRMs can be iterated simultaneously, which would also solve the problem. It assumes the order in both DRMs is the same, descending by Key = item ID. Even if a row is missing in one or the other this would work. Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ creates [B'B] and matrix multiply creates [B'A] On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
A quick map-reduce program should be able to join these matrices and produce documents ready to index. On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote: In writing the similarity matrices to Solr there is a bit of a problem. The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I know there is no guarantee that the ids of both matrices are in the same descending order. The easiest solution is to have an index for [B'B] and one for [B'A]. That means two or perhaps three queries for cross-recommendations, which is not ideal. First I'm going to create two collections of docs with different field ids--this should work and we can merge them later. Next we can do some m/r to group the docs by id so there is one collection (csv) with one line per doc. Alternatively it is a possible that the DRMs can be iterated simultaneously, which would also solve the problem. It assumes the order in both DRMs is the same, descending by Key = item ID. Even if a row is missing in one or the other this would work. Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ creates [B'B] and matrix multiply creates [B'A] On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
If you use the same partitioning and number of reducers for creating the outputs, the output should have the same number of sequence files and each sequence file should have the same keys in descending order. I don't understand why the ordering is a problem, can we not store the row index as a field in solr? 2013/8/5 Ted Dunning ted.dunn...@gmail.com A quick map-reduce program should be able to join these matrices and produce documents ready to index. On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote: In writing the similarity matrices to Solr there is a bit of a problem. The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I know there is no guarantee that the ids of both matrices are in the same descending order. The easiest solution is to have an index for [B'B] and one for [B'A]. That means two or perhaps three queries for cross-recommendations, which is not ideal. First I'm going to create two collections of docs with different field ids--this should work and we can merge them later. Next we can do some m/r to group the docs by id so there is one collection (csv) with one line per doc. Alternatively it is a possible that the DRMs can be iterated simultaneously, which would also solve the problem. It assumes the order in both DRMs is the same, descending by Key = item ID. Even if a row is missing in one or the other this would work. Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ creates [B'B] and matrix multiply creates [B'A] On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
I think m/r join is the best solution, too many assumptions otherwise. I thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. Is there a good example to start from in Mahout? Yes, one id field per doc. The problem is not storing, it is joining rows from two DRMs by simple iteration. On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote: If you use the same partitioning and number of reducers for creating the outputs, the output should have the same number of sequence files and each sequence file should have the same keys in descending order. I don't understand why the ordering is a problem, can we not store the row index as a field in solr? 2013/8/5 Ted Dunning ted.dunn...@gmail.com A quick map-reduce program should be able to join these matrices and produce documents ready to index. On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote: In writing the similarity matrices to Solr there is a bit of a problem. The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I know there is no guarantee that the ids of both matrices are in the same descending order. The easiest solution is to have an index for [B'B] and one for [B'A]. That means two or perhaps three queries for cross-recommendations, which is not ideal. First I'm going to create two collections of docs with different field ids--this should work and we can merge them later. Next we can do some m/r to group the docs by id so there is one collection (csv) with one line per doc. Alternatively it is a possible that the DRMs can be iterated simultaneously, which would also solve the problem. It assumes the order in both DRMs is the same, descending by Key = item ID. Even if a row is missing in one or the other this would work. Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ creates [B'B] and matrix multiply creates [B'A] On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com
Re: Setting up a recommender
I still don't understand why we need to rely on docids. If we simply index that row A is similar to rows B, C and D that should be fine, or am I wrong? 2013/8/5 Pat Ferrel p...@occamsmachete.com I think m/r join is the best solution, too many assumptions otherwise. I thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. Is there a good example to start from in Mahout? Yes, one id field per doc. The problem is not storing, it is joining rows from two DRMs by simple iteration. On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote: If you use the same partitioning and number of reducers for creating the outputs, the output should have the same number of sequence files and each sequence file should have the same keys in descending order. I don't understand why the ordering is a problem, can we not store the row index as a field in solr? 2013/8/5 Ted Dunning ted.dunn...@gmail.com A quick map-reduce program should be able to join these matrices and produce documents ready to index. On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote: In writing the similarity matrices to Solr there is a bit of a problem. The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I know there is no guarantee that the ids of both matrices are in the same descending order. The easiest solution is to have an index for [B'B] and one for [B'A]. That means two or perhaps three queries for cross-recommendations, which is not ideal. First I'm going to create two collections of docs with different field ids--this should work and we can merge them later. Next we can do some m/r to group the docs by id so there is one collection (csv) with one line per doc. Alternatively it is a possible that the DRMs can be iterated simultaneously, which would also solve the problem. It assumes the order in both DRMs is the same, descending by Key = item ID. Even if a row is missing in one or the other this would work. Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ creates [B'B] and matrix multiply creates [B'A] On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the
Re: Setting up a recommender
Sebastian, There needs to be a join of the two row similarity matrices to form documents. Pat, What about just updating the document with the fields? Have three passes. Pass 1 puts the normal meta-data for the item in place. Pass2 updates with data from B'B. Pass 3 udpates with data from B'A. This will cause the entire index to be rewritten more than necessary, but it should be fast enough to be a non-issue. On other fronts, I got musicbrainz downloaded over the weekend and have figured out the schema enough so that I think I can produce recording, artist and tag information. From that, I can simulate user behavior and produce logs to push into the demo system. That will allow realistic scale and will allow users to explore the system in terms that they understand. There is still a question of whether we can redistribute the musicbrainz data, but I think I can arrange it so that anybody who wants to run the demo will just download the necessary data themselves. I may host a derived data product myself to simplify that process. On Mon, Aug 5, 2013 at 10:59 AM, Sebastian Schelter s...@apache.org wrote: I still don't understand why we need to rely on docids. If we simply index that row A is similar to rows B, C and D that should be fine, or am I wrong? 2013/8/5 Pat Ferrel p...@occamsmachete.com I think m/r join is the best solution, too many assumptions otherwise. I thought Ted wanted a non-m/r implementation, but oh, well, mostly non-m/r. Is there a good example to start from in Mahout? Yes, one id field per doc. The problem is not storing, it is joining rows from two DRMs by simple iteration. On Aug 5, 2013, at 10:27 AM, Sebastian Schelter s...@apache.org wrote: If you use the same partitioning and number of reducers for creating the outputs, the output should have the same number of sequence files and each sequence file should have the same keys in descending order. I don't understand why the ordering is a problem, can we not store the row index as a field in solr? 2013/8/5 Ted Dunning ted.dunn...@gmail.com A quick map-reduce program should be able to join these matrices and produce documents ready to index. On Mon, Aug 5, 2013 at 10:10 AM, Pat Ferrel p...@occamsmachete.com wrote: In writing the similarity matrices to Solr there is a bit of a problem. The Matrices exist in two DRMs. The rows correspond to the doc IDs. As far as I know there is no guarantee that the ids of both matrices are in the same descending order. The easiest solution is to have an index for [B'B] and one for [B'A]. That means two or perhaps three queries for cross-recommendations, which is not ideal. First I'm going to create two collections of docs with different field ids--this should work and we can merge them later. Next we can do some m/r to group the docs by id so there is one collection (csv) with one line per doc. Alternatively it is a possible that the DRMs can be iterated simultaneously, which would also solve the problem. It assumes the order in both DRMs is the same, descending by Key = item ID. Even if a row is missing in one or the other this would work. Does anyone know if the DRMs are guaranteed to have row ordering by Key? RSJ creates [B'B] and matrix multiply creates [B'A] On Aug 2, 2013, at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row
Re: Setting up a recommender
Yeah thought of that one too but it still requires each be ordered by Key, in which case simultaneous iteration works in one pass I think. If the DRMs are always sorted by Key you can iterate through each at the same time, writing only when you have both fields or know there is a field missing from one DRM. If you get the same key you write a combined doc, if you have different ones, write out one sided until it catches up to the other. Every DRM I've examined seems to be ordered by key and I assume that is not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part file splits aren't a problem. A m/r join is pretty simple too but I'll go with non-m/r unless there is a problem above. BTW the schema for the Solr csv is: id,b_b_links,b_a_links item1,itemX itemY,itemZ am I missing some normal metadata? On Aug 5, 2013, at 11:05 AM, Ted Dunning ted.dunn...@gmail.com wrote: What about just updating the document with the fields? Have three passes. Pass 1 puts the normal meta-data for the item in place. Pass2 updates with data from B'B. Pass 3 udpates with data from B'A. This will cause the entire index to be rewritten more than necessary, but it should be fast enough to be a non-issue.
Re: Setting up a recommender
On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel pat.fer...@gmail.com wrote: Yeah thought of that one too but it still requires each be ordered by Key, in which case simultaneous iteration works in one pass I think. Multipass does not require ordering by key. Solr documents can be updated in any order. If the DRMs are always sorted by Key you can iterate through each at the same time, writing only when you have both fields or know there is a field missing from one DRM. If you get the same key you write a combined doc, if you have different ones, write out one sided until it catches up to the other. Yes. Merge will work when files are ordered and split consistently. I don't think we should be making that assumption. Every DRM I've examined seems to be ordered by key and I assume that is not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part file splits aren't a problem. But with the co- and cross- occurrence stuff, file splits could be a problem. A m/r join is pretty simple too but I'll go with non-m/r unless there is a problem above. The simplest join is to use Solr updates. This would require a minimal amount of programming, but less than writing a merge program. BTW the schema for the Solr csv is: id,b_b_links,b_a_links item1,itemX itemY,itemZ am I missing some normal metadata? An item description is nice.
Re: Setting up a recommender
we have a cross recommender in production for about 3 month now, with the difference that we use lucene to build indices from map reduce directly plus we do the same thing for 30+ customers, most of them with different input data structure (field names, values). we had something similar before (lucene, multiple cross relations) but also used the similarity score (llr) with a custom similarity and payloads but switched tp pure tedism after some helpful comments here. therefore i read this thread with a lot of interest. what i can add from my experiences: 1. i find it way easier to not talk about in this in matrix multiplication language but with contigency tables ( a and b, a and not b, not a and b, not a and not b), and also find the usage of the classical mahout similarity jobs hard. this is probably because of my basic matrix math skills, but also because using matrices leads to id usage and often the extracted items are text (search term, country, page section). thinking of this as related terms automatically gives a document view on the item to be recommended (the lucene doc) where description, name and everything is also just a field. 2. when doing a simple table it's just cooccurrences, marginals and totals. since the dimension of marginals is often not too big (items, browser, terms), we right now accumulate the counts in memory. maybe the RowSimilarityJob is working the same way. This can be changed to a different implementaton like on disk hash table or even count min sketch, if the number of items is too large. Main point is that the counting of marginals can be done on the fly when emitting all ooccurrences. 3. above in the thread there was a tip on approaching similarity scores with repeating terms. payloads are a better way for this and with lucene 4's doc values capability, there shouldn't be any mahout similarity not expressible by a lucene similarity. maybe it would be helpful to provide a lucene delivery system also for the classic mahout recommender package. it adds soo many possibilities for filtering and takes away a lot of point like caching etc. 4. a big question is the frequency of rebuilding. while the relations can often stay untouched for a day, the item data may change way more often (item churn, new items). it is therefore beneficial to separate those and have the possibility to rebuild the final index without calulating all similarities again (for very critical things this often leads to a lucene filter querying some external source to build up a lucene filter that restricts the index) besides that, i am very happy to see the ongoing effort on this topic and hope that i can contribute with something someday. Cheers, Johannes On Mon, Aug 5, 2013 at 10:27 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Aug 5, 2013 at 11:50 AM, Pat Ferrel pat.fer...@gmail.com wrote: Yeah thought of that one too but it still requires each be ordered by Key, in which case simultaneous iteration works in one pass I think. Multipass does not require ordering by key. Solr documents can be updated in any order. If the DRMs are always sorted by Key you can iterate through each at the same time, writing only when you have both fields or know there is a field missing from one DRM. If you get the same key you write a combined doc, if you have different ones, write out one sided until it catches up to the other. Yes. Merge will work when files are ordered and split consistently. I don't think we should be making that assumption. Every DRM I've examined seems to be ordered by key and I assume that is not an artifact of seqdumper. I'm using SequenceFileDirIterator so the part file splits aren't a problem. But with the co- and cross- occurrence stuff, file splits could be a problem. A m/r join is pretty simple too but I'll go with non-m/r unless there is a problem above. The simplest join is to use Solr updates. This would require a minimal amount of programming, but less than writing a merge program. BTW the schema for the Solr csv is: id,b_b_links,b_a_links item1,itemX itemY,itemZ am I missing some normal metadata? An item description is nice.
Re: Setting up a recommender
Yes. We need two different sets of documents if the row space of the cross/co-occurrence matrices are different as is the case with A'B and B'B. This could mean two indexes. Or a single index with a special field to indicate what type of record you have. On Fri, Aug 2, 2013 at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
I put some thought into this (actually I slept on it) and I think the answer is in the math. -- A = matrix of action2 by user, used for cross-action recommendations, for instance action2 = views. -- B = matrix of action1 by user, these are the primary recommenders actions, for instance action1 = purchases. -- H_a1 = all user history of action1 in column vectors. This may be all action1's recorded and so may = B' or it may have truncated history to get more recent activity in recs. -- H_a2 = all user history of action2 in column vectors. This may be all action2's recorded and so may = A' or it may have truncated history to get more recent activity in recs. -- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for action1. -- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was also an action1. recommendation are for action1. -- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are weighted to optimize results. The query on [B'A] will be column vectors from H_a2. Each is a user's history of action2 on A items. That is if there were different items in A than B then the query would be comprised of those items and against the field that contains those items. This brings up a bunch of other questions but for now we do not have separate items. It illustrates the fact that the query is user history of action2 so the items (though they have the same ID space in this case) should be from A or there would be no hits. Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows are the same as columns. The confusion may come from the fact that Ted's mental model does not have the same items for both A and B. So the document ID cannot = item ID since the docs contain items from both item ID spaces. In which case I don't know why they would be in the same doc at all but that is another discussion. This model does not allow us to fetch a doc by ID. But in our case since we have the same IDs in A and B we can put them in a doc of ID=item ID, the field similair_items can contain items from B similarityMatrix rows since they are the same as columns, the cross_action_similar_items field will contain columns from [B'A] This may just be mental looping--sleep only work about 50% of the time for me so maybe someone else can check this reasoning. Have a look at the data here https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the rows there too because they were from Bs items. This was the discussion about having different items for cross actions. The excerpt below is Ted responding to my question. So do we want the columns of [B'A]? It's only a transpose away. On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote: [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right? Yes. It is easier to understand if you have different kinds of items as well as different actions. For instance, suppose that you have user x query terms (A) and user x device (B). B'A is then device x term so that there is a row per device and the row contains terms. This is good when searching for devices using terms. Talking about getting the actual doc field values, which will include the similar_items field and other metadata. The actual ids in the similar_items field work well for anonymous/no-history recs but maybe there is a second query or fetch that I'm missing? I assumed that a fetch of the doc and it's fields by item ID was as fast a way to do this as possible. If there is some way to get the same result by doing a query that is faster, I'm all for it? Can do tomorrow at 2.
Re: Setting up a recommender
Apologies for thrashing--definitely doing some mental looping but look at the cross-similarities on the Template sheet of the Excel file. The rows of [B'A] intuitively look best. Specifically there was a user who viewed the Surface and Nexus but the columns do not account for that, the rows do. Going from rows to columns is the trivial addition of a transpose so I'm going to go ahead with rows for now. This affects the cross_action_similar_items and so only the cross-recommender part of the whole. On Aug 2, 2013, at 8:00 AM, Pat Ferrel pat.fer...@gmail.com wrote: I put some thought into this (actually I slept on it) and I think the answer is in the math. -- A = matrix of action2 by user, used for cross-action recommendations, for instance action2 = views. -- B = matrix of action1 by user, these are the primary recommenders actions, for instance action1 = purchases. -- H_a1 = all user history of action1 in column vectors. This may be all action1's recorded and so may = B' or it may have truncated history to get more recent activity in recs. -- H_a2 = all user history of action2 in column vectors. This may be all action2's recorded and so may = A' or it may have truncated history to get more recent activity in recs. -- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for action1. -- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was also an action1. recommendation are for action1. -- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are weighted to optimize results. The query on [B'A] will be column vectors from H_a2. Each is a user's history of action2 on A items. That is if there were different items in A than B then the query would be comprised of those items and against the field that contains those items. This brings up a bunch of other questions but for now we do not have separate items. It illustrates the fact that the query is user history of action2 so the items (though they have the same ID space in this case) should be from A or there would be no hits. Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows are the same as columns. The confusion may come from the fact that Ted's mental model does not have the same items for both A and B. So the document ID cannot = item ID since the docs contain items from both item ID spaces. In which case I don't know why they would be in the same doc at all but that is another discussion. This model does not allow us to fetch a doc by ID. But in our case since we have the same IDs in A and B we can put them in a doc of ID=item ID, the field similair_items can contain items from B similarityMatrix rows since they are the same as columns, the cross_action_similar_items field will contain columns from [B'A] This may just be mental looping--sleep only work about 50% of the time for me so maybe someone else can check this reasoning. Have a look at the data here https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.com wrote: Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the rows there too because they were from Bs items. This was the discussion about having different items for cross actions. The excerpt below is Ted responding to my question. So do we want the columns of [B'A]? It's only a transpose away. On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote: [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right? Yes. It is easier to understand if you have different kinds of items as well as different actions. For instance, suppose that you have user x query terms (A) and user x device (B). B'A is then device x term so that there is a row per device and the row contains terms. This is good when searching for devices using terms. Talking about getting the actual doc field values, which will include the similar_items field and other metadata. The actual ids in the similar_items field work well for anonymous/no-history recs but maybe there is a second query or fetch that I'm missing? I assumed that a fetch of the doc and it's fields by item ID was as fast a way to do this as possible. If there is some way to get the same result by doing a query that is faster, I'm all for it? Can do tomorrow at 2.
Re: Setting up a recommender
I think the sheet is very helpful. I was wondering about having at least one of the examples be where the actions deal with completely different things to maybe make it easier for newbies like me to grok the main points: purchases of items of type blah and views of videos, say. I think the input file has the same setup etc. I don't get the issue/questions that come up when we do have separate items. And I thought Ted mentioned at one point that the weighting of recommendation vectors might not be necessary based on some kind of solr magic, but I have no idea what that is. Btw, i was already thinking of doing something for my own clarification/edification that is similar to your spreadsheet, but would be a web page where a mouseover on one piece highlights the other pieces that generated it... E.g. The way the links in this pagerank explorer highlight the relevant portions of the google matrix ( https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are lots of other different pieces here of course, but show connections soup-to-nuts as much as possible. On Friday, August 2, 2013, Pat Ferrel wrote: I put some thought into this (actually I slept on it) and I think the answer is in the math. -- A = matrix of action2 by user, used for cross-action recommendations, for instance action2 = views. -- B = matrix of action1 by user, these are the primary recommenders actions, for instance action1 = purchases. -- H_a1 = all user history of action1 in column vectors. This may be all action1's recorded and so may = B' or it may have truncated history to get more recent activity in recs. -- H_a2 = all user history of action2 in column vectors. This may be all action2's recorded and so may = A' or it may have truncated history to get more recent activity in recs. -- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for action1. -- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was also an action1. recommendation are for action1. -- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are weighted to optimize results. The query on [B'A] will be column vectors from H_a2. Each is a user's history of action2 on A items. That is if there were different items in A than B then the query would be comprised of those items and against the field that contains those items. This brings up a bunch of other questions but for now we do not have separate items. It illustrates the fact that the query is user history of action2 so the items (though they have the same ID space in this case) should be from A or there would be no hits. Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows are the same as columns. The confusion may come from the fact that Ted's mental model does not have the same items for both A and B. So the document ID cannot = item ID since the docs contain items from both item ID spaces. In which case I don't know why they would be in the same doc at all but that is another discussion. This model does not allow us to fetch a doc by ID. But in our case since we have the same IDs in A and B we can put them in a doc of ID=item ID, the field similair_items can contain items from B similarityMatrix rows since they are the same as columns, the cross_action_similar_items field will contain columns from [B'A] This may just be mental looping--sleep only work about 50% of the time for me so maybe someone else can check this reasoning. Have a look at the data here https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.comjavascript:; wrote: Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the rows there too because they were from Bs items. This was the discussion about having different items for cross actions. The excerpt below is Ted responding to my question. So do we want the columns of [B'A]? It's only a transpose away. On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.comjavascript:; wrote: [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right? Yes. It is easier to understand if you have different kinds of items as well as different actions. For instance, suppose that you have user x query terms (A) and user x device (B). B'A is then device x term so that there is a row per device and the
Re: Setting up a recommender
On 8/2/13 12:13 PM, B Lyon bradfl...@gmail.com wrote: The way the links in this pagerank explorer highlight the relevant portions of the google matrix ( https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are lots That is pretty darn cool, great job!
Re: Setting up a recommender
This first cut project explicitly assumes a unified user and item space. This works well for many action pairs, not for others. The reason I did this to begin with was for using multiple actions for ecom recs. Views were not very predictive of purchases alone and needed the cross-recommender treatment. We did this using Mahout matrix math so the issue of what to write to Solr did not come up. It worked fine but now we find the need for an online method that will make use of realtime generated preferences, so ones not in the batch training data. The math still works for multiple item spaces but users must be in common. More generally the rank and ID space currently associated with users must be the same. Feel free to create examples if you want. Ted has some ideas for using multiple item spaces in presos that are on Slideshare I think. On Aug 2, 2013, at 10:13 AM, B Lyon bradfl...@gmail.com wrote: I think the sheet is very helpful. I was wondering about having at least one of the examples be where the actions deal with completely different things to maybe make it easier for newbies like me to grok the main points: purchases of items of type blah and views of videos, say. I think the input file has the same setup etc. I don't get the issue/questions that come up when we do have separate items. And I thought Ted mentioned at one point that the weighting of recommendation vectors might not be necessary based on some kind of solr magic, but I have no idea what that is. Btw, i was already thinking of doing something for my own clarification/edification that is similar to your spreadsheet, but would be a web page where a mouseover on one piece highlights the other pieces that generated it... E.g. The way the links in this pagerank explorer highlight the relevant portions of the google matrix ( https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA/). There are lots of other different pieces here of course, but show connections soup-to-nuts as much as possible. On Friday, August 2, 2013, Pat Ferrel wrote: I put some thought into this (actually I slept on it) and I think the answer is in the math. -- A = matrix of action2 by user, used for cross-action recommendations, for instance action2 = views. -- B = matrix of action1 by user, these are the primary recommenders actions, for instance action1 = purchases. -- H_a1 = all user history of action1 in column vectors. This may be all action1's recorded and so may = B' or it may have truncated history to get more recent activity in recs. -- H_a2 = all user history of action2 in column vectors. This may be all action2's recorded and so may = A' or it may have truncated history to get more recent activity in recs. -- [B'B]H_a1 = R_a1, recommendations from action1. Recommendation are for action1. -- [B'A]H_a2 = R_a2, recommendations calculated from action2 where there was also an action1. recommendation are for action1. -- R_a1+ R_a2 = R, assumes a non-weighted linear combination, ideally they are weighted to optimize results. The query on [B'A] will be column vectors from H_a2. Each is a user's history of action2 on A items. That is if there were different items in A than B then the query would be comprised of those items and against the field that contains those items. This brings up a bunch of other questions but for now we do not have separate items. It illustrates the fact that the query is user history of action2 so the items (though they have the same ID space in this case) should be from A or there would be no hits. Therefore we need the columns of [B'A], and [B'B]. [B'B] is symmetric so rows are the same as columns. The confusion may come from the fact that Ted's mental model does not have the same items for both A and B. So the document ID cannot = item ID since the docs contain items from both item ID spaces. In which case I don't know why they would be in the same doc at all but that is another discussion. This model does not allow us to fetch a doc by ID. But in our case since we have the same IDs in A and B we can put them in a doc of ID=item ID, the field similair_items can contain items from B similarityMatrix rows since they are the same as columns, the cross_action_similar_items field will contain columns from [B'A] This may just be mental looping--sleep only work about 50% of the time for me so maybe someone else can check this reasoning. Have a look at the data here https://github.com/pferrel/solr-recommender/blob/master/src/test/resources/Recommender%20Math.xlsx On Aug 1, 2013, at 6:00 PM, Pat Ferrel pat.fer...@gmail.comjavascript:; wrote: Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the rows there too because
Re: Setting up a recommender
We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
I really like this approach, especially as it makes it possible to individually recompute and update certain similarity matrices. Furthermore it should enable rapid experimentation as its super easy to retrieve recommendations based on differed behaviors. 2013/8/2 Ted Dunning ted.dunn...@gmail.com So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
Got away with that stupid comment. All doc ids will be from B items even in the general case. On Aug 2, 2013, at 2:39 PM, Pat Ferrel p...@occamsmachete.com wrote: Thanks, well put. In order to have the ultimate impl with two id spaces for A and B would we have to create different docs for A'B and B'B? Since the docs IDs must come from A or B? The fields can contain different sets of IDs but the Doc ID must be one or the other, right? Doesn't this imply separate indexes for the separate A, B item IDs spaces? This is not a question for this first cut impl but is a generalization question. On Aug 2, 2013, at 2:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: So there is a lot of good discussion here and there were some key ideas. The first idea is that the *input* to a recommender is on the right in the matrix notation. This refers inherently to the id's on the columns of the recommender product (either B'B or B'A). The columns are defined by the right hand element of the product (either B or A in the B'B and B'A respectively). The results are in the row space and are defined by the left hand operand of the product. IN the case of B'A and B'B, the left hand operand is B in both cases so the row space is consistent. In order to implement this in a search engine, we need documents that correspond to rows of B'A or B'B. These are the same as the columns of B. The fields of the documents will necessarily include the following: id: the column id from B corresponding to this item description: presentation info ... yada yada b-a-links: contents of this row of B'A expressed as id's from the column space of A where this row of llr-filter(B'A) contains a non-zero value. b-b-links: contents of this row of B'B expressed as id's from the column space of B ... The following operations are now single queries: get an item where id = x query is [id:x] recommend based on behavior with regard to A items and actions h_a query is [b-a-links: h_a] recommend based on behavior with regard to B items and actions h_b query is [b-b-links: h_b] recommend based on a single item with id = x query is [b-b-links: x] recommend based on composite behavior composed of h_a and h_b query is [b-a-links: h_a b-b-links: h_b] Does this make sense by being more explicit? Now, it is pretty clear that we could have an index of A objects as well but the link fields would have to be a-a-links and a-b-links, of course. On Fri, Aug 2, 2013 at 1:25 PM, Pat Ferrel pat.fer...@gmail.com wrote: Assuming Ted needs to call it, not sure if an invite has gone out, I haven't seen one. On Aug 2, 2013, at 12:49 PM, B Lyon bradfl...@gmail.com wrote: I am planning on sitting in as flaky connection allows. On Aug 2, 2013 3:21 PM, Pat Ferrel pat.fer...@gmail.com wrote: We doing a hangout at 2 on the Solr recommender?
Re: Setting up a recommender
Not following so… Here so is what I've done in probably too much detail: 1) ingest raw log files and split them up by action 2) turn these into Mahout preference files using Mahout type IDs, keeping a map of IDs 3) run the Mahout Item-based recommender using LLR for similarity 4) created a Mahout style cross-recommender using cooccurrence similarity using matrix math 5) given two similairty matrixes and a user history matrix I am writing them to csv files with Mahout ID replaced by the original string external IDs for users and items input log file before splitting: u1 purchaseiphone u1 purchaseipad u2 purchasenexus-tablet u2 purchasegalaxy u3 purchasesurface u4 purchaseiphone u4 purchaseipad u1 viewiphone u1 viewipad u1 viewnexus-tablet u1 viewgalaxy u2 viewiphone u2 viewipad u2 viewnexus-tablet u2 viewgalaxy u3 viewsurface u4 viewiphone u4 viewipad u4 viewnexus-tablet Input user history DRM after ID translation to mahout IDs and splitting for action purchase B user/item iphone ipadnexus-tabletgalaxy surface u1 1 1 0 0 0 u2 0 0 1 1 0 u3 0 0 0 0 1 u4 1 1 0 0 0 Map of IDs Mahout to Original/External 0 - iphone 1 - ipad 2 - nexus-tablet 3 - galaxy 4 - surface To be specific the DRM from the RecommenderJob with item-item similarities using LLR looks like this: Input Path: out/p-recs/sims/part-r-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: {1:0.8472157541208549} Key: 1: Value: {0:0.8472157541208549} Key: 2: Value: {3:0.8181382096075936} Key: 3: Value: {2:0.8181382096075936} Key: 4: Value: {} This will be written to a directory for later Solr indexing as a csv of the form: item_id,similar_items,cross_action_similar_items iphone,ipad, ipad,iphone, nexus-tablet,galaxy, galaxy, nexus-tablet, surface,, By using a user's history vector as a query you get results = recommendations So if the user is u1, the history vector is: iphone ipad The Solr results for query iphone ipad using field similar_items will be 1. Doc ID, ipad 2. Doc ID, iphone If you want item similarities, for instance if a user is anonymous with no history and is looking at an iphone product page. You would fetch the doc for id = iphone and get: ipad Perhaps a bad example for ordering, since there is only one ID in the doc but the items in the similar_items field would be ordered by similarity strength. Likewise for the cross-action similarities though the matrix will have cooccurrence [B'A] values in the DRM. For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. On Jul 31, 2013, at 4:52 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, See inline On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote: So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Right. Doesn't matter what format. Might want quotes around space delimited lists, but anything will do. Note: As I mentioned before the order of the items in the field will encode rank of the similarity strength. This is for cases where you want to find similar items to a context item. You would fetch the doc for the context item by it's item ID and show the top k items in the doc. Ted's caveat would probably be to dither them. I always say dither so that is an easy one. But fetching similar items of a center item by fetching the center item and then fetching each of the referenced items is typically slower by about 2x than running the search for mentions of the center item. Sounds like Ted is generating data. Andrew or M Lyon do either of you want to set the demo system up? If so you'll need to find a system--free tier AWS, Ted's box, etc. Then install all the needed stuff. I'll get the output working to csv. On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the
Re: Setting up a recommender
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. No. First, you need to retrieve all the other documents that are referenced to get their display meta-data. So this isn't just a one document fetch. Second, the similar items point inwards, not outwards. Thus, the query you want has the id of the current item and searches the similar_items field. The result of that search is all of the similar items. The confusion here may stem from the name of the field. A name like linked-from-items or some such might help here. Another way to look at this is that there should be no procedural difference if you have 10 items or 20 in your history. Either way, your history is a query against the appropriate link fields. Likewise, there should be no difference between having 10 items or 2 items in your history. There shouldn't even be any difference if you have even just 1 item in your history. Finding items similar to a single item is exactly like having 1 item in your history. So that should be done by searching with that one item in the appropriate link fields.
Re: Setting up a recommender
Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm off base here, maybe a skype or im session will straighten me out. pat.fer...@gmail.com or p...@occamsmachete.com To be clear below I'm not talking about history based recs, which is the primary use case. I am talking about a query that does not use history, that only finds similar items based on training data. The item-item similarity matrix DRM contains Key = item ID, Value = list of item IDs with similarity strengths. This is equivalent to the list returned by ItemBasedRecommender's public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws TasteException Specified by: mostSimilarItems in interface ItemBasedRecommender Parameters: itemID - ID of item for which to find most similar other items howMany - desired number of most similar items to find Returns: items most similar to the given item, ordered from most similar to least To get the list from Solr you would fetch the doc associated with itemID, no? When using the Mahout mapreduce item-based recommender we get the similarity matrix and do just that. We get the row associated with the Mahout itemID and recommend the top k items from the vector. This performs well in cross-validation tests. On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. No. First, you need to retrieve all the other documents that are referenced to get their display meta-data. So this isn't just a one document fetch. Second, the similar items point inwards, not outwards. Thus, the query you want has the id of the current item and searches the similar_items field. The result of that search is all of the similar items. The confusion here may stem from the name of the field. A name like linked-from-items or some such might help here. Another way to look at this is that there should be no procedural difference if you have 10 items or 20 in your history. Either way, your history is a query against the appropriate link fields. Likewise, there should be no difference between having 10 items or 2 items in your history. There shouldn't even be any difference if you have even just 1 item in your history. Finding items similar to a single item is exactly like having 1 item in your history. So that should be done by searching with that one item in the appropriate link fields.
Re: Setting up a recommender
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote: Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? Each row = one *field* in a Solr doc. Different DRM's produce different fields in the same docs. There will also be item meta-data in the field. For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm off base here, maybe a skype or im session will straighten me out. pat.ferrel@gmail.comor p...@occamsmachete.com Actually, that is a grand idea. Let's do a hangout. From the who-is-free-whenhttps://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewformsurvey, it looks like lots of people are available tomorrow at 2PM PDT. Would that work? To be clear below I'm not talking about history based recs, which is the primary use case. I am talking about a query that does not use history, that only finds similar items based on training data. The item-item similarity matrix DRM contains Key = item ID, Value = list of item IDs with similarity strengths. Yes. I absolutely agree that you can do this. These should, strictly speaking, be columns in the item-item matrix. The item-item matrix may or may not be symmetric. If it is symmetric, then column or row doesn't matter. This is equivalent to the list returned by ItemBasedRecommender's public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws TasteException Yes. Specified by: mostSimilarItems in interface ItemBasedRecommender Parameters: itemID - ID of item for which to find most similar other items howMany - desired number of most similar items to find Returns: items most similar to the given item, ordered from most similar to least To get the list from Solr you would fetch the doc associated with itemID, no? If you store the column, then yes. If you store the row, then using a query on the field containing the similar items is the right answer. The key difference that I have is what happens in the next step. When using the Mahout mapreduce item-based recommender we get the similarity matrix and do just that. We get the row associated with the Mahout itemID and recommend the top k items from the vector. This performs well in cross-validation tests. Good. I think that there is a row/column confusion here, but they are probably nearly identical in your application. The key point is what happens *after* you do the query that you are suggesting. In your case, you have to retrieve the meta-data associated with each of related items. I like to store this meta-data in a Solr field (or three) so this involves at least one additional query. You can automatically chain this second query by using the join operation that Solr provides, but the second query still happens. If you do the query the way that I suggest, this second query doesn't need to happen. You get the meta-data directly. On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. No. First, you need to retrieve all the other documents that are referenced to get their display meta-data. So this isn't just a one document fetch. Second, the similar items point inwards, not outwards. Thus, the query you want has the id of the current item and searches the similar_items field. The result of that search is all of the similar items. The confusion here may stem from the name of the field. A name like linked-from-items or some such might help here. Another way to look at this is that there should be no procedural difference if you have 10 items or 20 in your history. Either way, your history is a query against the appropriate link fields. Likewise, there should be no difference between having 10 items or 2 items in your history. There shouldn't even be any difference if you have even just 1 item in your history. Finding items similar to a single item is exactly like having 1 item in your history. So that should be done by searching with that one item in the appropriate link fields.
Re: Setting up a recommender
I am wondering about row/column confusion as well - fleshing out the doc/design with more specifics (which Pat is kind of doing, basically) should make things obvious eventually, imo. The way Pat had phrased it got me to wondering what rationale you use to rank the results when you are querying the columns (similar column, similar via action 2 column, etc.). He had mentioned the auxiliary case of simply getting most similar items to a given docid by just going to the row for that docid and using the pre-sorted values in the similar column, and I thought Ted might have hinted that you could just as well do a solr query of the column with that single docid as the query; however, in the latter case I wonder if the order and list itself could be weird, as some items may show up simply because they are not similar to many things: lower LLR values that got filtered in the list for docid itself won't get filtered when you're looking at the other not similar to very many items things when generating their list for the solr field.. I guess using an absolute cutoff for LLR in the filtering could deal with some of this issue. All hypothetical at the moment (for me, anyway), as real data might trivially dismiss some of these concerns as irrelevant. I think the hangout is a good idea, too, btw, and hope to be able to sit in if it happens. Very excited about this approach. On Thu, Aug 1, 2013 at 6:03 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote: Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? Each row = one *field* in a Solr doc. Different DRM's produce different fields in the same docs. There will also be item meta-data in the field. For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm off base here, maybe a skype or im session will straighten me out. pat.ferrel@gmail.comor p...@occamsmachete.com Actually, that is a grand idea. Let's do a hangout. From the who-is-free-when https://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewform survey, it looks like lots of people are available tomorrow at 2PM PDT. Would that work? To be clear below I'm not talking about history based recs, which is the primary use case. I am talking about a query that does not use history, that only finds similar items based on training data. The item-item similarity matrix DRM contains Key = item ID, Value = list of item IDs with similarity strengths. Yes. I absolutely agree that you can do this. These should, strictly speaking, be columns in the item-item matrix. The item-item matrix may or may not be symmetric. If it is symmetric, then column or row doesn't matter. This is equivalent to the list returned by ItemBasedRecommender's public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws TasteException Yes. Specified by: mostSimilarItems in interface ItemBasedRecommender Parameters: itemID - ID of item for which to find most similar other items howMany - desired number of most similar items to find Returns: items most similar to the given item, ordered from most similar to least To get the list from Solr you would fetch the doc associated with itemID, no? If you store the column, then yes. If you store the row, then using a query on the field containing the similar items is the right answer. The key difference that I have is what happens in the next step. When using the Mahout mapreduce item-based recommender we get the similarity matrix and do just that. We get the row associated with the Mahout itemID and recommend the top k items from the vector. This performs well in cross-validation tests. Good. I think that there is a row/column confusion here, but they are probably nearly identical in your application. The key point is what happens *after* you do the query that you are suggesting. In your case, you have to retrieve the meta-data associated with each of related items. I like to store this meta-data in a Solr field (or three) so this involves at least one additional query. You can automatically chain this second query by using the join operation that Solr provides, but the second query still happens. If you do the query the way that I suggest, this second query doesn't need to happen. You get the meta-data directly. On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something
Re: Setting up a recommender
Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the rows there too because they were from Bs items. This was the discussion about having different items for cross actions. The excerpt below is Ted responding to my question. So do we want the columns of [B'A]? It's only a transpose away. On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote: [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right? Yes. It is easier to understand if you have different kinds of items as well as different actions. For instance, suppose that you have user x query terms (A) and user x device (B). B'A is then device x term so that there is a row per device and the row contains terms. This is good when searching for devices using terms. Talking about getting the actual doc field values, which will include the similar_items field and other metadata. The actual ids in the similar_items field work well for anonymous/no-history recs but maybe there is a second query or fetch that I'm missing? I assumed that a fetch of the doc and it's fields by item ID was as fast a way to do this as possible. If there is some way to get the same result by doing a query that is faster, I'm all for it? Can do tomorrow at 2.
Re: Setting up a recommender
A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Someone could probably donate their free tier EC2 instance and set this up pretty easily. Not sure if this would fit given free tier memory but maybe for small data sets. To get this available for actual use we'd need: 1-- An instance with an IP address somewhere to run the ingestion and customized LucidWorks Search. 2-- Synthetic data created using Ted's tool. 3-- Customized Solr indexing code for integration with LucidWorks? Not sure how this is done. I can do the Solr part but have not looked into Lucid integration yet. 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running example. Assuming I've got this right, does someone want to help with these? Another way to approach this is to create a stand alone codebase that requires Mahout and Solr and supplies an API something like the proposed Mahout SGD online recommender or Myrrix. This would be easier to consume but would lack all the UI and inspection code of LucidWorks.
Re: Setting up a recommender
Assuming I've got this right, does someone want to help with these? Pat -- I would be interested in helping in anyway needed. I believe Ted's tool is a start, but does not handle all the case envisioned in the design doc, although I could be wrong on this. Anyway I'm pretty open to helping wherever needed. Thanks, Andrew On 7/31/13 12:20 PM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Someone could probably donate their free tier EC2 instance and set this up pretty easily. Not sure if this would fit given free tier memory but maybe for small data sets. To get this available for actual use we'd need: 1-- An instance with an IP address somewhere to run the ingestion and customized LucidWorks Search. 2-- Synthetic data created using Ted's tool. 3-- Customized Solr indexing code for integration with LucidWorks? Not sure how this is done. I can do the Solr part but have not looked into Lucid integration yet. 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running example. Assuming I've got this right, does someone want to help with these? Another way to approach this is to create a stand alone codebase that requires Mahout and Solr and supplies an API something like the proposed Mahout SGD online recommender or Myrrix. This would be easier to consume but would lack all the UI and inspection code of LucidWorks.
Re: Setting up a recommender
OK, looks like there *is* some magic in the Lucid config. I believe all I need to do is write out the docs using Solr XML defining fields for each similarity type and the doc name. The rest can be done by standard Lucid hand configuration. I believe this will minimally handle #3 below. On Jul 31, 2013, at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Someone could probably donate their free tier EC2 instance and set this up pretty easily. Not sure if this would fit given free tier memory but maybe for small data sets. To get this available for actual use we'd need: 1-- An instance with an IP address somewhere to run the ingestion and customized LucidWorks Search. 2-- Synthetic data created using Ted's tool. 3-- Customized Solr indexing code for integration with LucidWorks? Not sure how this is done. I can do the Solr part but have not looked into Lucid integration yet. 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running example. Assuming I've got this right, does someone want to help with these? Another way to approach this is to create a stand alone codebase that requires Mahout and Solr and supplies an API something like the proposed Mahout SGD online recommender or Myrrix. This would be easier to consume but would lack all the UI and inspection code of LucidWorks.
Re: Setting up a recommender
I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things. On Jul 31, 2013 2:31 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: Assuming I've got this right, does someone want to help with these? Pat -- I would be interested in helping in anyway needed. I believe Ted's tool is a start, but does not handle all the case envisioned in the design doc, although I could be wrong on this. Anyway I'm pretty open to helping wherever needed. Thanks, Andrew On 7/31/13 12:20 PM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Someone could probably donate their free tier EC2 instance and set this up pretty easily. Not sure if this would fit given free tier memory but maybe for small data sets. To get this available for actual use we'd need: 1-- An instance with an IP address somewhere to run the ingestion and customized LucidWorks Search. 2-- Synthetic data created using Ted's tool. 3-- Customized Solr indexing code for integration with LucidWorks? Not sure how this is done. I can do the Solr part but have not looked into Lucid integration yet. 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running example. Assuming I've got this right, does someone want to help with these? Another way to approach this is to create a stand alone codebase that requires Mahout and Solr and supplies an API something like the proposed Mahout SGD online recommender or Myrrix. This would be easier to consume but would lack all the UI and inspection code of LucidWorks.
Re: Setting up a recommender
OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things.
Re: Setting up a recommender
On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Yes. And I note in a later email that you discovered how Lucid provides lots of connectors for different formats. XML is fine. I have also used CSV. Someone could probably donate their free tier EC2 instance and set this up pretty easily. Not sure if this would fit given free tier memory but maybe for small data sets. It should fit, actually. I can donate a real-ish machine as well. To get this available for actual use we'd need: 1-- An instance with an IP address somewhere to run the ingestion and customized LucidWorks Search. 2-- Synthetic data created using Ted's tool. 3-- Customized Solr indexing code for integration with LucidWorks? Not sure how this is done. I can do the Solr part but have not looked into Lucid integration yet. 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running example. Assuming I've got this right, does someone want to help with these? I will work on synthetic data later today. I have a tool that does this for drill. I plan to pull down musicBrainz and use the tags on artists as hidden variables to drive synthetic user behavior. Should produce reasonable looking recommendations. Another way to approach this is to create a stand alone codebase that requires Mahout and Solr and supplies an API something like the proposed Mahout SGD online recommender or Myrrix. This would be easier to consume but would lack all the UI and inspection code of LucidWorks. I think that for a demo, the inspection is crucial. Adding the API is easy and can even be done in the same instance as LW is running.
Re: Setting up a recommender
The input, which we need synthesized is a log file tsv or csv that looks like this: u1 purchaseiphone u1 purchaseipad u2 purchasenexus-tablet u2 purchasegalaxy u3 purchasesurface u4 purchaseiphone u4 purchaseipad u1 viewiphone u1 viewipad u1 viewnexus-tablet u1 viewgalaxy u2 viewiphone u2 viewipad u2 viewnexus-tablet u2 viewgalaxy u3 viewsurface u4 viewiphone u4 viewipad u4 viewnexus-tablet This is the example in the github project solr-recommender/src/test/resources/logged-preferences/* The columns can be in any order and can have other columns interspersed. For testing it would be nice to have one action, two, and several. This implementation is in-memory for mapping ids so nothing huge as far as how many ids are generated. Ted can talk about the distribution of actions. On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things. On Jul 31, 2013 2:31 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: Assuming I've got this right, does someone want to help with these? Pat -- I would be interested in helping in anyway needed. I believe Ted's tool is a start, but does not handle all the case envisioned in the design doc, although I could be wrong on this. Anyway I'm pretty open to helping wherever needed. Thanks, Andrew On 7/31/13 12:20 PM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Someone could probably donate their free tier EC2 instance and set this up pretty easily. Not sure if this would fit given free tier memory but maybe for small data sets. To get this available for actual use we'd need: 1-- An instance with an IP address somewhere to run the ingestion and customized LucidWorks Search. 2-- Synthetic data created using Ted's tool. 3-- Customized Solr indexing code for integration with LucidWorks? Not sure how this is done. I can do the Solr part but have not looked into Lucid integration yet. 4-- Flesh out the rest of Ted's outline but 1-3 will give a minimally running example. Assuming I've got this right, does someone want to help with these? Another way to approach this is to create a stand alone codebase that requires Mahout and Solr and supplies an API something like the proposed Mahout SGD online recommender or Myrrix. This would be easier to consume but would lack all the UI and inspection code of LucidWorks.
Re: Setting up a recommender
The fields actually point the other direction. They contain items which, if they appear in a history, indicate that the current document is a good recommendation. This reversal of roles is what makes search work. Going the other way works for a single doc, but that only gives a list of id's which then have to be retrieved. Better to have the tags for the single doc on all the related docs so that a single retrieval will pull them all in with their details. On Wed, Jul 31, 2013 at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things.
Re: Setting up a recommender
I'd vote for csv then. On Jul 31, 2013, at 12:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jul 31, 2013 at 11:20 AM, Pat Ferrel pat.fer...@gmail.com wrote: A few architectural questions: http://bit.ly/18vbbaT I created a local instance of the LucidWorks Search on my dev machine. I can quite easily save the similarity vectors from the DRMs into docs at special locations and index them with LucidWorks. But to ingest the docs and put them in separate fields of the same index we need some new code (unless I've missed some Lucid config magic) that does the indexing and integrates with LucidWorks. I imagine two indexes. One index for the similarity matrix and optionally the cross-similairty matrix in two fields of type 'string'. Another index for users' history--we could put the docs there for retrieval by user ID. The user history docs then become the query on the similarity index and would return recommendations. Or any realtime collected or generated history could be used too. Is this what you imagined Ted? Especially WRT Lucid integration? Yes. And I note in a later email that you discovered how Lucid provides lots of connectors for different formats. XML is fine. I have also used CSV.
Re: Setting up a recommender
Sorry not sure what you are saying. If the LLR created DRM has a row: Key: 0, Value { 1:1.0,} where 0 - iphone and 1 - ipad then wouldn't the doc look like doc field name='item_id'ipad/field field name='similar_items'iphone/field /doc or rather the csv equivalent? On Jul 31, 2013, at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: The fields actually point the other direction. They contain items which, if they appear in a history, indicate that the current document is a good recommendation. This reversal of roles is what makes search work. Going the other way works for a single doc, but that only gives a list of id's which then have to be retrieved. Better to have the tags for the single doc on all the related docs so that a single retrieval will pull them all in with their details. On Wed, Jul 31, 2013 at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things.
Re: Setting up a recommender
oops, mistyped… If the LLR created DRM has a row: Key: 1, Value { 0:1.0,} where 0 - iphone and 1 - ipad then wouldn't the doc look like doc field name='item_id'ipad/field field name='similar_items'iphone/field /doc On Jul 31, 2013, at 12:14 PM, Pat Ferrel pat.fer...@gmail.com wrote: Sorry not sure what you are saying. If the LLR created DRM has a row: Key: 0, Value { 1:1.0,} where 0 - iphone and 1 - ipad then wouldn't the doc look like doc field name='item_id'ipad/field field name='similar_items'iphone/field /doc or rather the csv equivalent? On Jul 31, 2013, at 12:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: The fields actually point the other direction. They contain items which, if they appear in a history, indicate that the current document is a good recommendation. This reversal of roles is what makes search work. Going the other way works for a single doc, but that only gives a list of id's which then have to be retrieved. Better to have the tags for the single doc on all the related docs so that a single retrieval will pull them all in with their details. On Wed, Jul 31, 2013 at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things.
Re: Setting up a recommender
So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Note: As I mentioned before the order of the items in the field will encode rank of the similarity strength. This is for cases where you want to find similar items to a context item. You would fetch the doc for the context item by it's item ID and show the top k items in the doc. Ted's caveat would probably be to dither them. Sounds like Ted is generating data. Andrew or M Lyon do either of you want to set the demo system up? If so you'll need to find a system--free tier AWS, Ted's box, etc. Then install all the needed stuff. I'll get the output working to csv. On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things.
Re: Setting up a recommender
Slick idea IMO on the ordering in the field. Fyi to answer your question I am new to a lot of these pieces (and without sustained access to nontablet pc next four days) and cannot at the moment be relied on for the demo setup given this apparent pace, but would like to help as possible with grunt/doc stuff if someone more familiar with the relevant pieces can use it. On Wednesday, July 31, 2013, Pat Ferrel wrote: So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Note: As I mentioned before the order of the items in the field will encode rank of the similarity strength. This is for cases where you want to find similar items to a context item. You would fetch the doc for the context item by it's item ID and show the top k items in the doc. Ted's caveat would probably be to dither them. Sounds like Ted is generating data. Andrew or M Lyon do either of you want to set the demo system up? If so you'll need to find a system--free tier AWS, Ted's box, etc. Then install all the needed stuff. I'll get the output working to csv. On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.comjavascript:; wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com javascript:; wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things. -- BF Lyon http://www.nowherenearithaca.com
Re: Setting up a recommender
Pat, See inline On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote: So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Right. Doesn't matter what format. Might want quotes around space delimited lists, but anything will do. Note: As I mentioned before the order of the items in the field will encode rank of the similarity strength. This is for cases where you want to find similar items to a context item. You would fetch the doc for the context item by it's item ID and show the top k items in the doc. Ted's caveat would probably be to dither them. I always say dither so that is an easy one. But fetching similar items of a center item by fetching the center item and then fetching each of the referenced items is typically slower by about 2x than running the search for mentions of the center item. Sounds like Ted is generating data. Andrew or M Lyon do either of you want to set the demo system up? If so you'll need to find a system--free tier AWS, Ted's box, etc. Then install all the needed stuff. I'll get the output working to csv. On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the could-be-recommended things.
Re: Setting up a recommender
Well its a work in progress but you can see it here: https://github.com/pferrel/solr-recommender There is no Solr integration yet, it is just ingest, create id indexes, run RecommenderJob, and XRecommenderJob. These create the item similarity matrixes, which will be put into Solr. They also create all recommendations for all users. The code is quite, er..., fresh. If you are actually going to work on the project or test it, I can fix things as they come up but not all options are supported or needed to get the overall system running. Put bugs in github. The happy path works with my trivial sample data so I'll proceed to moving the sim matrixes to Solr. I'll revisit robustifying the project later if it proves useful.
Re: Setting up a recommender
Actually I'm not sure the downsampling is best put in RowSimilarityJob since that doesn't work for the XRecommender. The similarity matrix there is calculated by [B'A] matrix multiply. RSJ would be great if it could work on two DRMs, then we could use other similarity measures (LLR please). Also I'm not sure if its needed in RSJ since I use PreparePreferenceMatrixJob for the RecommenderJob, which calculates the main action item similarity matrix (using RSJ in any case). But for the XRecommender I modified PreparePreferenceMatrixJob to create two DRMs and called it PreparePreferenceMatrixesJob. It has downsampling in it, if you mean limiting the number of prefs per user. Check if I'm wrong. On Jul 29, 2013, at 10:17 PM, Sebastian Schelter s...@apache.org wrote: Downsampling is now moved directly into RowSimilarityJob. I'll have a look at Pat's code later this week. On 23.07.2013 19:38, Ted Dunning wrote: On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote: This pipeline lacks downsampling since I had to replace PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is the person to talk to about these bits? I think that is a good source. If you post your code, he may be able to comment on how to integrate the down-sampling in a general way.
Re: Setting up a recommender
In the cross-recommender the similarity matrix is calculated doing [B'A]. We want the rows to be stored as the item-item similarities in Solr right? [B'B] is symmetric so just want to make sure I have it straight for [B'A]. B = purchases iphone ipadnexus galaxy surface u1 1 1 0 0 0 u2 0 0 1 1 0 u3 0 0 0 0 1 u4 1 1 0 0 0 B' = u1 u2 u3 u4 iphone 1 0 0 1 ipad1 0 0 1 nexus 0 1 0 0 galaxy 0 1 0 0 surface 0 0 1 0 A = views iphone ipadnexus galaxy surface u1 1 1 1 1 0 u2 1 1 1 1 0 u3 0 0 0 0 1 u4 1 1 1 0 0 [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right?
Re: Setting up a recommender
On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote: [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right? Yes. It is easier to understand if you have different kinds of items as well as different actions. For instance, suppose that you have user x query terms (A) and user x device (B). B'A is then device x term so that there is a row per device and the row contains terms. This is good when searching for devices using terms.
Re: Setting up a recommender
Downsampling is now moved directly into RowSimilarityJob. I'll have a look at Pat's code later this week. On 23.07.2013 19:38, Ted Dunning wrote: On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote: This pipeline lacks downsampling since I had to replace PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is the person to talk to about these bits? I think that is a good source. If you post your code, he may be able to comment on how to integrate the down-sampling in a general way.
Re: Setting up a recommender
I've got a new configurable action splitter working with my old Mahout based recommender and cross-recommender. Need more cleanup and testing before integrating Solr or handing off. I think I'll leave the old recommenders in the code with an option to replace the last 'make recommendations' step with moving the similarity matrixes into Solr. Might be useful for results comparison. We still need more help with retrieving user history vectors, and making Solr queries. Not to mention setting up the inspection UI mentioned in Ted's paper. http://bit.ly/18vbbaT On Jul 24, 2013, at 8:32 PM, Pat Ferrel pat.fer...@gmail.com wrote: Understood, catalog categories, tags, etc will make good metadata to be included in the query and putting in separate fields allows us to separately boost each in the query. UserIDs that have interacted with the item is an interesting idea. However the specific case I'm describing is not about content similarity. Talking here about item-item similarity exactly as encoded in the similarity matrix. The order or rank of these item-item similarities should be preserved and I was proposing doing so with the order of the itemID terms in the document. The query will return history based recs ranked by the order Solr applies. The doc itself for any item contains similar items ordered by their similarity magnitude, precalculated in Mahout RowSimilarityJob. On Jul 24, 2013, at 7:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: Content based item similarity is a fine thing to include in a separate field. In addition, it is reasonable to describe a person's history in terms of the meta-data on the items they have interacted with. That allows you to build a set of socially driven meta-data indicators as well. This can be useful in the restaurant example where you might find that elegant or home-style might be good indicators for different restaurants even if those terms don't appear in a restaurant description. Sent from my iPhone On Jul 23, 2013, at 18:26, Pat Ferrel pat.fer...@gmail.com wrote: Honestly not trying to make this more complicated but… In the purely Mahout cross-recommender we got a ranked list of similar items for any item so we could combine personal history-based recs with non-personalized item similarity-based recs wherever we had an item context. In a past ecom case the item similarity recs were quite useful when a user was looking at an item already. In that case even if the user was unknown we could make item similarity-based recs. How about if we order the items in the doc by rank in the existing fields since they are just text? Then we would do user-history-based queries on the fields for recs and docs[itemID].field to get the ordered list of items out of any doc. Doing an ensemble would require weights though. Unless someone knows a rank based method for combining results. I guess you could vote or add rank numbers of like items or the log thereof... I assume the combination of results from [B'B] and [B'A] will be a query over both fields with some boost or other to handle ensemble weighting. But if you want to add item similarity recs another method must be employed, no? From past experience I strongly suspect item similarity rank is not something we want to lose so unless someone has a better idea I'll just order the IDs in the fields and call it good for now.
Re: Setting up a recommender
On 7/23/13 7:26 PM, Pat Ferrel wrote: Honestly not trying to make this more complicated but… From past experience I strongly suspect item similarity rank is not something we want to lose so unless someone has a better idea I'll just order the IDs in the fields and call it good for now. If I understand you correctly, you are concerned about just throwing all the items in without regard to order, or weight). I think Ted's suggestion was not to worry about that, but if you do have time and want to tackle this, one thing you can do is to add an item multiple times. For example, suppose you have items A, B, C, ... with A ranked highest. Then index a document in Solr like this: A A A B B C this will end up giving A a higher frequency count in the index. The number of repeats would be kind of arbitrary. You might want to make it a linear function of rank or a quantized version of the similarity score. But this might end up being a noise-level effect ... it's probably not worth losing sleep over. On the other hand, it's probably less useful to order the IDs since once they get put in the index the token order is stored as a position which isn't (usually) used for scoring, although I suppose some custom scorer could do that, too. -Mike
Re: Setting up a recommender
I'm most worried about losing ordering and I think I can just order the items A B C by convention. Using Mahout to do clustering we used to double or triple add the title to get artificial boosting without fields. The technique works and may be worth an experiment later, thanks. BTW it looks like similarity and TFIDF are plugable in Solr and seem pretty easy to change. Planning to use cosine for the first cut since it's default. On Jul 24, 2013, at 4:10 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 7/23/13 7:26 PM, Pat Ferrel wrote: Honestly not trying to make this more complicated but… From past experience I strongly suspect item similarity rank is not something we want to lose so unless someone has a better idea I'll just order the IDs in the fields and call it good for now. If I understand you correctly, you are concerned about just throwing all the items in without regard to order, or weight). I think Ted's suggestion was not to worry about that, but if you do have time and want to tackle this, one thing you can do is to add an item multiple times. For example, suppose you have items A, B, C, ... with A ranked highest. Then index a document in Solr like this: A A A B B C this will end up giving A a higher frequency count in the index. The number of repeats would be kind of arbitrary. You might want to make it a linear function of rank or a quantized version of the similarity score. But this might end up being a noise-level effect ... it's probably not worth losing sleep over. On the other hand, it's probably less useful to order the IDs since once they get put in the index the token order is stored as a position which isn't (usually) used for scoring, although I suppose some custom scorer could do that, too. -Mike
Re: Setting up a recommender
Content based item similarity is a fine thing to include in a separate field. In addition, it is reasonable to describe a person's history in terms of the meta-data on the items they have interacted with. That allows you to build a set of socially driven meta-data indicators as well. This can be useful in the restaurant example where you might find that elegant or home-style might be good indicators for different restaurants even if those terms don't appear in a restaurant description. Sent from my iPhone On Jul 23, 2013, at 18:26, Pat Ferrel pat.fer...@gmail.com wrote: Honestly not trying to make this more complicated but… In the purely Mahout cross-recommender we got a ranked list of similar items for any item so we could combine personal history-based recs with non-personalized item similarity-based recs wherever we had an item context. In a past ecom case the item similarity recs were quite useful when a user was looking at an item already. In that case even if the user was unknown we could make item similarity-based recs. How about if we order the items in the doc by rank in the existing fields since they are just text? Then we would do user-history-based queries on the fields for recs and docs[itemID].field to get the ordered list of items out of any doc. Doing an ensemble would require weights though. Unless someone knows a rank based method for combining results. I guess you could vote or add rank numbers of like items or the log thereof... I assume the combination of results from [B'B] and [B'A] will be a query over both fields with some boost or other to handle ensemble weighting. But if you want to add item similarity recs another method must be employed, no? From past experience I strongly suspect item similarity rank is not something we want to lose so unless someone has a better idea I'll just order the IDs in the fields and call it good for now. On Jul 23, 2013, at 12:03 PM, Pat Ferrel p...@occamsmachete.com wrote: Will do. For what it's worth… The project I'm working on is an online recommender for video content. You go to a site I'm creating, make some picks and get recommendations immediately online. The training data comes from mining rotten tomatoes for critics reviews. There are two actions, rotten fresh. Was planning to toss the 'rotten' except for filtering them out of any recs but maybe they would work as A with an ensemble weight of -1? New thumbs up or down data would be put into the training set periodically--not online--using the process outlined below. On Jul 23, 2013, at 10:37 AM, Ted Dunning ted.dunn...@gmail.com wrote: This sounds great. Go for it. Put a comment on the design doc with a pointer to text that I should import. On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote: I can supply: 1) a Maven based project in a public github repo as a baseline that creates the following 2) ingest and split actions, in-memory, single process, from text file, one line per preference 3) create DistributedRowMatrixes one per action (max of 3) with unified item and user space 4) create the 'similarity matrix' for [B'B] using LLR and [B'A] using matrix multiply/cooccurrence. 5) can take a stab at loading Solr. I know the Mahout side and the internal to external ID translation. The Solr side sounds pretty simple for this case. This pipeline lacks downsampling since I had to replace PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is the person to talk to about these bits? The job this creates uses the hadoop script to launch. Each job extends AbstractJob so runs locally or using HDFS or mapreduce (at least for the Mahout parts). I have some obligations coming up so if you want this I'll need to get moving. I can have the project ready on github in a day or two. May take longer to do the Solr integration and if someone has a passion for taking that bit on--great. This work is in my personal plans for the next couple weeks as it happens anyway. Let me know if you want me to proceed. On Jul 22, 2013, at 3:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote: Yes. And the combined recommender would query on both at the same time. Pat-- doesn't it need ensemble type weighting for each recommender component? Probably a wishlist item for later? Yes. Weighting different fields differently is a very nice (and very easy feature).
Re: Setting up a recommender
Understood, catalog categories, tags, etc will make good metadata to be included in the query and putting in separate fields allows us to separately boost each in the query. UserIDs that have interacted with the item is an interesting idea. However the specific case I'm describing is not about content similarity. Talking here about item-item similarity exactly as encoded in the similarity matrix. The order or rank of these item-item similarities should be preserved and I was proposing doing so with the order of the itemID terms in the document. The query will return history based recs ranked by the order Solr applies. The doc itself for any item contains similar items ordered by their similarity magnitude, precalculated in Mahout RowSimilarityJob. On Jul 24, 2013, at 7:19 PM, Ted Dunning ted.dunn...@gmail.com wrote: Content based item similarity is a fine thing to include in a separate field. In addition, it is reasonable to describe a person's history in terms of the meta-data on the items they have interacted with. That allows you to build a set of socially driven meta-data indicators as well. This can be useful in the restaurant example where you might find that elegant or home-style might be good indicators for different restaurants even if those terms don't appear in a restaurant description. Sent from my iPhone On Jul 23, 2013, at 18:26, Pat Ferrel pat.fer...@gmail.com wrote: Honestly not trying to make this more complicated but… In the purely Mahout cross-recommender we got a ranked list of similar items for any item so we could combine personal history-based recs with non-personalized item similarity-based recs wherever we had an item context. In a past ecom case the item similarity recs were quite useful when a user was looking at an item already. In that case even if the user was unknown we could make item similarity-based recs. How about if we order the items in the doc by rank in the existing fields since they are just text? Then we would do user-history-based queries on the fields for recs and docs[itemID].field to get the ordered list of items out of any doc. Doing an ensemble would require weights though. Unless someone knows a rank based method for combining results. I guess you could vote or add rank numbers of like items or the log thereof... I assume the combination of results from [B'B] and [B'A] will be a query over both fields with some boost or other to handle ensemble weighting. But if you want to add item similarity recs another method must be employed, no? From past experience I strongly suspect item similarity rank is not something we want to lose so unless someone has a better idea I'll just order the IDs in the fields and call it good for now.
Re: Setting up a recommender
This sounds great. Go for it. Put a comment on the design doc with a pointer to text that I should import. On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote: I can supply: 1) a Maven based project in a public github repo as a baseline that creates the following 2) ingest and split actions, in-memory, single process, from text file, one line per preference 3) create DistributedRowMatrixes one per action (max of 3) with unified item and user space 4) create the 'similarity matrix' for [B'B] using LLR and [B'A] using matrix multiply/cooccurrence. 5) can take a stab at loading Solr. I know the Mahout side and the internal to external ID translation. The Solr side sounds pretty simple for this case. This pipeline lacks downsampling since I had to replace PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is the person to talk to about these bits? The job this creates uses the hadoop script to launch. Each job extends AbstractJob so runs locally or using HDFS or mapreduce (at least for the Mahout parts). I have some obligations coming up so if you want this I'll need to get moving. I can have the project ready on github in a day or two. May take longer to do the Solr integration and if someone has a passion for taking that bit on--great. This work is in my personal plans for the next couple weeks as it happens anyway. Let me know if you want me to proceed. On Jul 22, 2013, at 3:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote: Yes. And the combined recommender would query on both at the same time. Pat-- doesn't it need ensemble type weighting for each recommender component? Probably a wishlist item for later? Yes. Weighting different fields differently is a very nice (and very easy feature).
Re: Setting up a recommender
On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote: This pipeline lacks downsampling since I had to replace PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is the person to talk to about these bits? I think that is a good source. If you post your code, he may be able to comment on how to integrate the down-sampling in a general way.
Re: Setting up a recommender
Will do. For what it's worth… The project I'm working on is an online recommender for video content. You go to a site I'm creating, make some picks and get recommendations immediately online. The training data comes from mining rotten tomatoes for critics reviews. There are two actions, rotten fresh. Was planning to toss the 'rotten' except for filtering them out of any recs but maybe they would work as A with an ensemble weight of -1? New thumbs up or down data would be put into the training set periodically--not online--using the process outlined below. On Jul 23, 2013, at 10:37 AM, Ted Dunning ted.dunn...@gmail.com wrote: This sounds great. Go for it. Put a comment on the design doc with a pointer to text that I should import. On Tue, Jul 23, 2013 at 9:39 AM, Pat Ferrel p...@occamsmachete.com wrote: I can supply: 1) a Maven based project in a public github repo as a baseline that creates the following 2) ingest and split actions, in-memory, single process, from text file, one line per preference 3) create DistributedRowMatrixes one per action (max of 3) with unified item and user space 4) create the 'similarity matrix' for [B'B] using LLR and [B'A] using matrix multiply/cooccurrence. 5) can take a stab at loading Solr. I know the Mahout side and the internal to external ID translation. The Solr side sounds pretty simple for this case. This pipeline lacks downsampling since I had to replace PreparePreferenceMatrixJob and potentially LLR for [B'A]. I assume Sebastian is the person to talk to about these bits? The job this creates uses the hadoop script to launch. Each job extends AbstractJob so runs locally or using HDFS or mapreduce (at least for the Mahout parts). I have some obligations coming up so if you want this I'll need to get moving. I can have the project ready on github in a day or two. May take longer to do the Solr integration and if someone has a passion for taking that bit on--great. This work is in my personal plans for the next couple weeks as it happens anyway. Let me know if you want me to proceed. On Jul 22, 2013, at 3:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote: Yes. And the combined recommender would query on both at the same time. Pat-- doesn't it need ensemble type weighting for each recommender component? Probably a wishlist item for later? Yes. Weighting different fields differently is a very nice (and very easy feature).
Re: Setting up a recommender
+10 Love the academics but I agree with this. Recently saw a VP from Netflix plead with the audience (mostly academics) to move past RMSE--focus on maximizing correct ranking, not rating prediction. Anyway I have a pipeline that does the following: ingests logs either TSV or CSV of arbitrary column ordering--will pick out the actions by position and string replaces PreparePreferenceMatrixJob to create n matrixes depending on the number of actions you are splitting out. This job also creates external - internal item and user id BiHashMaps for going back and forth between the log's IDs and Mahout internal IDs. It guarantees a uniform item and user ID space and sparse matrix ranks by creating one from all actions. Not completely scalable since it is not done in m/r though it uses HDFS--I have a plan to m/r the process and get rid of the hashmap. performs the RowSimilarityJob on the primary matrix B and does B'A to create a cooccurrence matrix for primary and secondary actions. It then goes on to use the rest of the mahout pipeline on B to get recs and does a [B'A]H_v to calculate all cross-recommendations. Stores all recs from all models in a NoSQL DB. At rec request time it does a linear combination of req and cross-rec to return the highest scored ones. The stored IDs were external so all ready for display. Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written to Solr as the original external IDs from the log files, which were strings. This allows them to be treated as terms by Solr. My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. So the cross-recommender would just put the cross-action similarity matrix in other field(s) on the same itemID/docID, right? Then the straight out recommender queries on the B'B field(s) and the cross-recommender queries on the B'A field(s). I suppose to keep it simple the cross-action similarity matrix could be put in a separate index. Is this about right? On Jul 21, 2013, at 5:30 PM, Sebastian Schelter s...@apache.org wrote: At the moment, the down sampling is done by PreparePreferenceMatrixJob for the collaborative filtering functionality. We just want to move it down to RowSimilarityJob to enable standalone usage. I think that the CrossRecommender should be the next thing on our agenda, after we have the deployment infrastructure. I especially like that its capable to include different kinds of interactions, as opposed to most other (academically motivated) recommenders that focus on a single interaction type like a rating. --sebastian On 22.07.2013 02:14, Ted Dunning wrote: The row similarity downsampling is just a matter of dropping elements at random from rows that have more data than we want. If the join that puts the row together can handle two kinds of input, then RowSimilarity can be easily modified to be CrossRowSimilarity. Likewise, if we have two DRM's with the same row id's in the same order, we can do a map-side merge. Such a merge can be very efficient on a system like MapR where you can control files to live on the same nodes. On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel pat.fer...@gmail.com wrote: RowSimilarity downsampling? Are you referring to the a mod of the matrix multiply to do cross similarity with LLR for the cross recommendations? So similarity of rows of B with rows of A? Sounds like you are proposing not only putting a recommender in Solr but also a cross-recommender? This is why getting a real data set is problematic? On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, Yes. The first part probably just is the RowSimilarity job, especially after Sebastian puts in the down-sampling. The new part is exactly as you say, storing the DRM into Solr indexes. There is no reason to not use a real data set. There is a strong reason to use a synthetic dataset, however, in that it can be trivially scaled up and down both in items and users. Also, the synthetic dataset doesn't require that the real data be found and downloaded. On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote: Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning
Re: Setting up a recommender
On 07/22/2013 12:20 PM, Pat Ferrel wrote: My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. This is the key question for me. The best idea I've had is to use termFreq as a proxy for weight. It's only an integer, so there are scaling issues to consider, but you can apply a per-field weight to manage that. Also, Lucene (and Solr) doesn't provide an obvious way to load term frequencies directly: probably the simplest thing to do is just to repeat the cross-term N times and let the text analysis take care of counting them. Inefficient, but probably the quickest way to get going. Alternatively, there are some lower level Lucene indexing APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, but would allow for more direct loading of fields. Then one probably wants to override the scoring in some way (unless TFIDF is the way to go somehow??)
Re: Setting up a recommender
Just to make sure if I understood correctly, Ted, could you please correct me?:) 1. Using a search engine, I will treat items as documents, where each document vector consists of other items (similar to words of documents) with co-occurrence (LLR) weights (instead of tf-idf in a search engine analogy). So for each item I will have a sparse vector that represents the relation of that item to other items, if there is an indicator that makes the item-to-item similarity (co-occurrence) non-zero. (I will only use positive feedback, I think, since I am counting co-occurrences) 2. To present recommendations, the system formulates a query, with a history of items --the session history for task based recommendation, or a long term history. And the search engine will find top-N items, based on the cosine similarities of the item (document) vectors and history (query) vectors. 3. For example, if that was a restaurant recommendation, and we knew that the restaurant was famous for its sushi, I would index this in another field, famous_for. Now if, as a user, I asked for sushi restaurants that I would enjoy, the system would add this to query along with my history, and the famous sushi restaurant would rank higher in results, even if chances are equal that I would like a steakhouse according to the computation in 2. 4. Since this is a search engine, and a search engine can boost a particular field, the system would let the famous_for overweigh the collaborative activity, or the opposite (According to the use case, or for example, number of items in the history) So I can define a weighting (voting, or mixture of experts) scheme to blend different recommenders. Are those correct? Gokhan On Mon, Jul 22, 2013 at 9:07 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 07/22/2013 12:20 PM, Pat Ferrel wrote: My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. This is the key question for me. The best idea I've had is to use termFreq as a proxy for weight. It's only an integer, so there are scaling issues to consider, but you can apply a per-field weight to manage that. Also, Lucene (and Solr) doesn't provide an obvious way to load term frequencies directly: probably the simplest thing to do is just to repeat the cross-term N times and let the text analysis take care of counting them. Inefficient, but probably the quickest way to get going. Alternatively, there are some lower level Lucene indexing APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, but would allow for more direct loading of fields. Then one probably wants to override the scoring in some way (unless TFIDF is the way to go somehow??)
Re: Setting up a recommender
My experience is that TFIDF works just fine, especially as first cut. Adding different kinds of data, building out backend A/B testing, tuning the UI, weighting the query all come the next round of weighting changes. Typically, the priority stack never empties enough for that task to rise to the top. On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 07/22/2013 12:20 PM, Pat Ferrel wrote: My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. This is the key question for me. The best idea I've had is to use termFreq as a proxy for weight. It's only an integer, so there are scaling issues to consider, but you can apply a per-field weight to manage that. Also, Lucene (and Solr) doesn't provide an obvious way to load term frequencies directly: probably the simplest thing to do is just to repeat the cross-term N times and let the text analysis take care of counting them. Inefficient, but probably the quickest way to get going. Alternatively, there are some lower level Lucene indexing APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, but would allow for more direct loading of fields. Then one probably wants to override the scoring in some way (unless TFIDF is the way to go somehow??)
Re: Setting up a recommender
Inline ... slightly redundant relative to other answers, but that shouldn't be a problem. On Mon, Jul 22, 2013 at 11:56 AM, Gokhan Capan gkhn...@gmail.com wrote: Just to make sure if I understood correctly, Ted, could you please correct me?:) 1. Using a search engine, I will treat items as documents, where each document vector consists of other items (similar to words of documents) with co-occurrence (LLR) weights (instead of tf-idf in a search engine analogy). LLR will just select indicators. Weighting can be done using native TF-IDF stuff that Solr already does. So for each item I will have a sparse vector that represents the relation of that item to other items, if there is an indicator that makes the item-to-item similarity (co-occurrence) non-zero. (I will only use positive feedback, I think, since I am counting co-occurrences) Yes. Moreover, there will ultimately be multiple fields with different sets of indicators. This is how cross recommendation can be integrated. 2. To present recommendations, the system formulates a query, with a history of items --the session history for task based recommendation, or a long term history. And the search engine will find top-N items, based on the cosine similarities of the item (document) vectors and history (query) vectors. Yes. Cosine-ish ... the search engine has its own similarity calculation. That can be tuned ... later. 3. For example, if that was a restaurant recommendation, and we knew that the restaurant was famous for its sushi, I would index this in another field, famous_for. Now if, as a user, I asked for sushi restaurants that I would enjoy, the system would add this to query along with my history, and the famous sushi restaurant would rank higher in results, even if chances are equal that I would like a steakhouse according to the computation in 2. Yes. Moreover, we might put all the words in the descriptions of restaurants you have been to lately into a different history field. Each restaurant would also have an indicator word field against which we could query using your history words. Similarly, we could use cuisine classifiers. And we can compute a local favorite feature that is essentially a recommendation indicator from people in a particular area to restaurants. Recommendation queries can include any or all of these. Specialized pages might have a cuisine specific recommendation set for you. 4. Since this is a search engine, and a search engine can boost a particular field, the system would let the famous_for overweigh the collaborative activity, or the opposite (According to the use case, or for example, number of items in the history) So I can define a weighting (voting, or mixture of experts) scheme to blend different recommenders. yes. I would recommend doing the blending in the search engine query itself. Are those correct? Pretty much!
Re: Setting up a recommender
On Mon, Jul 22, 2013 at 9:20 AM, Pat Ferrel p...@occamsmachete.com wrote: +10 Love the academics but I agree with this. Recently saw a VP from Netflix plead with the audience (mostly academics) to move past RMSE--focus on maximizing correct ranking, not rating prediction. Anyway I have a pipeline that does *[ingest, prepare, row-similarity, not in m/r]* Is this available? replaces PreparePreferenceMatrixJob to create n matrixes depending on the number of actions you are splitting out. This job also creates external - internal item and user id BiHashMaps for going back and forth between the log's IDs and Mahout internal IDs. It guarantees a uniform item and user ID space and sparse matrix ranks by creating one from all actions. Not completely scalable since it is not done in m/r though it uses HDFS--I have a plan to m/r the process and get rid of the hashmap. Frankly, doing it outside of map-reduce is good for a start and should be preserved for later. It makes on-boarding new folks much easier. performs the RowSimilarityJob on the primary matrix B and does B'A to create a cooccurrence matrix for primary and secondary actions. What code do you use for B'A? Stores all recs from all models in a NoSQL DB. I recommend not doing this for the demo, but rather storing rows of B'A and B'B as fields in Solr. At rec request time it does a linear combination of req and cross-rec to return the highest scored ones. Should be integrated into the query. Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written to Solr as the original external IDs from the log files, which were strings. This allows them to be treated as terms by Solr. Yes. These early steps are very much what I was aiming for. My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. For a particular item document, the corresponding row of B'A and the corresponding row of B'B go into separate fields. I think you mean B'B when you say B's row similarity matrix. Just checking. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. Again, I just use native Solr weighting. So the cross-recommender would just put the cross-action similarity matrix in other field(s) on the same itemID/docID, right? Yes. Exactly. Then the straight out recommender queries on the B'B field(s) and the cross-recommender queries on the B'A field(s). I suppose to keep it simple the cross-action similarity matrix could be put in a separate index. Is this about right? Yes. And the combined recommender would query on both at the same time.
Re: Setting up a recommender
So you are proposing just grabbing the top N scoring related items and indexing listing them without regard to weight? Effectively quantizing the weights to = 1, and 0 for everything else? I guess LLR tends to do that anyway -Mike On 07/22/2013 02:57 PM, Ted Dunning wrote: My experience is that TFIDF works just fine, especially as first cut. Adding different kinds of data, building out backend A/B testing, tuning the UI, weighting the query all come the next round of weighting changes. Typically, the priority stack never empties enough for that task to rise to the top. On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 07/22/2013 12:20 PM, Pat Ferrel wrote: My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. This is the key question for me. The best idea I've had is to use termFreq as a proxy for weight. It's only an integer, so there are scaling issues to consider, but you can apply a per-field weight to manage that. Also, Lucene (and Solr) doesn't provide an obvious way to load term frequencies directly: probably the simplest thing to do is just to repeat the cross-term N times and let the text analysis take care of counting them. Inefficient, but probably the quickest way to get going. Alternatively, there are some lower level Lucene indexing APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, but would allow for more direct loading of fields. Then one probably wants to override the scoring in some way (unless TFIDF is the way to go somehow??)
Re: Setting up a recommender
inline BTW if there is an LLR cross-similarity job (replacing [B'A] it is easy to integrate. On Jul 22, 2013, at 12:09 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Mon, Jul 22, 2013 at 9:20 AM, Pat Ferrel p...@occamsmachete.com wrote: +10 Love the academics but I agree with this. Recently saw a VP from Netflix plead with the audience (mostly academics) to move past RMSE--focus on maximizing correct ranking, not rating prediction. Anyway I have a pipeline that does *[ingest, prepare, row-similarity, not in m/r]* Is this available? Pat-- Can quickly be. In Github. I'd have to clean up a bit. replaces PreparePreferenceMatrixJob to create n matrixes depending on the number of actions you are splitting out. This job also creates external - internal item and user id BiHashMaps for going back and forth between the log's IDs and Mahout internal IDs. It guarantees a uniform item and user ID space and sparse matrix ranks by creating one from all actions. Not completely scalable since it is not done in m/r though it uses HDFS--I have a plan to m/r the process and get rid of the hashmap. Frankly, doing it outside of map-reduce is good for a start and should be preserved for later. It makes on-boarding new folks much easier. Pat-- It uses the hadoop version of the matrix mult and RowSimilarityJob in later steps but they work without a cluster in local mode. performs the RowSimilarityJob on the primary matrix B and does B'A to create a cooccurrence matrix for primary and secondary actions. What code do you use for B'A? Pat-- matrix transposes and multiply from Mahout. Stores all recs from all models in a NoSQL DB. I recommend not doing this for the demo, but rather storing rows of B'A and B'B as fields in Solr. Pat-- yes, just explaining for completeness At rec request time it does a linear combination of req and cross-rec to return the highest scored ones. Should be integrated into the query. Pat-- yes, just explaining for completeness Does 1-3 fit the first part of 'offline to Solr'? The IDs can be written to Solr as the original external IDs from the log files, which were strings. This allows them to be treated as terms by Solr. Yes. These early steps are very much what I was aiming for. Pat-- OK, happy to contribute if possible let me know who to coordinate with. My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. For a particular item document, the corresponding row of B'A and the corresponding row of B'B go into separate fields. I think you mean B'B when you say B's row similarity matrix. Just checking. Pat-- yes, exactly That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. Again, I just use native Solr weighting. Pat-- good, that makes this fairly simple I expect. Just fields with bags of term strings. So the cross-recommender would just put the cross-action similarity matrix in other field(s) on the same itemID/docID, right? Yes. Exactly. Then the straight out recommender queries on the B'B field(s) and the cross-recommender queries on the B'A field(s). I suppose to keep it simple the cross-action similarity matrix could be put in a separate index. Is this about right? Yes. And the combined recommender would query on both at the same time. Pat-- doesn't it need ensemble type weighting for each recommender component? Probably a wishlist item for later?
Re: Setting up a recommender
On Mon, Jul 22, 2013 at 12:40 PM, Pat Ferrel p...@occamsmachete.com wrote: Yes. And the combined recommender would query on both at the same time. Pat-- doesn't it need ensemble type weighting for each recommender component? Probably a wishlist item for later? Yes. Weighting different fields differently is a very nice (and very easy feature).
Re: Setting up a recommender
Not entirely without regard to weight. Just without regard to designing weights specific to this application. The weights that Solr uses natively are intuitively what we want (rare indicators have higher weights in a log-ish kind of way). Frankly, I doubt the effectiveness here of mathematical reasoning for getting a better weighting. The deviations from optimal relative to the Solr defaults are probably as large as the deviations from the assumptions that the mathematically motivated weightings are based on. Fixing this is spending a lot for small potatoes. Fixing the data flow and getting access to more data is far higher value. On Mon, Jul 22, 2013 at 12:18 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: So you are proposing just grabbing the top N scoring related items and indexing listing them without regard to weight? Effectively quantizing the weights to = 1, and 0 for everything else? I guess LLR tends to do that anyway -Mike On 07/22/2013 02:57 PM, Ted Dunning wrote: My experience is that TFIDF works just fine, especially as first cut. Adding different kinds of data, building out backend A/B testing, tuning the UI, weighting the query all come the next round of weighting changes. Typically, the priority stack never empties enough for that task to rise to the top. On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov msoko...@safaribooksonline.com** wrote: On 07/22/2013 12:20 PM, Pat Ferrel wrote: My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. This is the key question for me. The best idea I've had is to use termFreq as a proxy for weight. It's only an integer, so there are scaling issues to consider, but you can apply a per-field weight to manage that. Also, Lucene (and Solr) doesn't provide an obvious way to load term frequencies directly: probably the simplest thing to do is just to repeat the cross-term N times and let the text analysis take care of counting them. Inefficient, but probably the quickest way to get going. Alternatively, there are some lower level Lucene indexing APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, but would allow for more direct loading of fields. Then one probably wants to override the scoring in some way (unless TFIDF is the way to go somehow??)
Re: Setting up a recommender
Fair enough - thanks for clarifying. I wondered whether that would be worth the trouble, also. Maybe one the academics Pat mentioned will test and find out for us :) On 7/22/13 6:45 PM, Ted Dunning wrote: Not entirely without regard to weight. Just without regard to designing weights specific to this application. The weights that Solr uses natively are intuitively what we want (rare indicators have higher weights in a log-ish kind of way). Frankly, I doubt the effectiveness here of mathematical reasoning for getting a better weighting. The deviations from optimal relative to the Solr defaults are probably as large as the deviations from the assumptions that the mathematically motivated weightings are based on. Fixing this is spending a lot for small potatoes. Fixing the data flow and getting access to more data is far higher value. On Mon, Jul 22, 2013 at 12:18 PM, Michael Sokolov msoko...@safaribooksonline.com mailto:msoko...@safaribooksonline.com wrote: So you are proposing just grabbing the top N scoring related items and indexing listing them without regard to weight? Effectively quantizing the weights to = 1, and 0 for everything else? I guess LLR tends to do that anyway -Mike On 07/22/2013 02:57 PM, Ted Dunning wrote: My experience is that TFIDF works just fine, especially as first cut. Adding different kinds of data, building out backend A/B testing, tuning the UI, weighting the query all come the next round of weighting changes. Typically, the priority stack never empties enough for that task to rise to the top. On Mon, Jul 22, 2013 at 11:07 AM, Michael Sokolov msoko...@safaribooksonline.com mailto:msoko...@safaribooksonline.com wrote: On 07/22/2013 12:20 PM, Pat Ferrel wrote: My understanding of the Solr proposal puts B's row similarity matrix in a vector per item. That means each row is turned into terms = external IDs--not sure how the weights of each term are encoded. This is the key question for me. The best idea I've had is to use termFreq as a proxy for weight. It's only an integer, so there are scaling issues to consider, but you can apply a per-field weight to manage that. Also, Lucene (and Solr) doesn't provide an obvious way to load term frequencies directly: probably the simplest thing to do is just to repeat the cross-term N times and let the text analysis take care of counting them. Inefficient, but probably the quickest way to get going. Alternatively, there are some lower level Lucene indexing APIs (DocFieldConsumer et al) which I haven't really plumbed entirely, but would allow for more direct loading of fields. Then one probably wants to override the scoring in some way (unless TFIDF is the way to go somehow??)
Re: Setting up a recommender
I see Ted created JIRA ticket for this already: https://issues.apache.org/jira/browse/MAHOUT-1288 We should consider changing issue type (currently - bug). One might find this Berlin Buzzwords 2013 recordinghttp://www.youtube.com/watch?v=fWR1T2pY08Yand slideshttp://www.slideshare.net/tdunning/buzz-wordsdunningmultimodalrecommendationof Ted's talk on the subject helpful to understand the terms used and idea. I guess we could start with single kind of interaction/behavior, and consider adding more later. Shall we make it separate subproject (so on level of mahout and site, but still under mahout svn), or make a new mahout submodule, or change mahout examples from single module to a multimodule structure and add the recommender demo as submodule there? I'm fine with Maven tasks, to some extent Solr too (not the most recent versions, but I see it as nice opportunity to update). Kind regards, Stevo Slavic. On Sun, Jul 21, 2013 at 12:15 AM, Ted Dunning ted.dunn...@gmail.com wrote: To kick this off, I have created a design document that is open for comments. Much detail is needed here. I will create a JIRA as well, but the google doc is much easier for collating lots of input into a coherent document. The directory that the document is stored in is accessible at http:// bit.ly/18vbbaT http://bit.ly/18vbbaT Once we get going, we can talk about how to coordinate tasks between hangouts. One option is a public Trello project: https://trello.com/ or we can use JIRA sub-tasks. On Sat, Jul 20, 2013 at 11:25 AM, Andrew Psaltis andrew.psal...@webtrends.com wrote: I am very interested in collaborating on the off-line to Solr part. Just let me know how we want to get going. Thanks, Andrew On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: I'm currently working for a portal that has a similar use case and I was thinking of implementing this in a similar way. I'm generating recommendations using python scripts based on similarity measures (content based recommendation) only using euclidean distance and some weights for each attribute. I want to use mahout's GenericItemBasedRecommender to generate these same recommendations without user data (no tracking right now of user to item relationship). I was thinking of pushing the generated recommendations to solr using atomic updates since my fields are all stored right now. Since this is very similar to what I'm trying to accomplish, I would sign up to collaborate in any way I can since I'm fairly familiar with solr and I'm starting to learn my way around mahout. On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org wrote: I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable
Re: Setting up a recommender
Hi, First of all, Ted, very inspiring video, I really enjoyed the concept of cross-occurrences. Secondly, I'd be very interested in collaborating on this project and here is why. I've been recently working for my employer on a very similar project that is currently deployed into our production environment. We built a recommender system that takes instances from an ontology identified in documents as part of an NLP process as an input, and generates document recommendations as an output. We used a big training set with positive and false positive matches to improve the accuracy of the output. All these documents are indexed in Solr for which we built a recommender RequestHandler that makes use of a RecommenderQParsePlugin we also built for Solr. With this we can provide recommendations to a user that is reading a document, but in next iterations we are working towards providing recommendations based on multiple kinds of inputs not only annotations. This said, I would like to collaborate with you guys on the development part of this project, just let me know how/where we can organize the user stories and tasks. I think a conference call, maybe a hangout, to kick off the project would be useful, who should schedule it? Thanks Iker 2013/7/20 Ted Dunning ted.dunn...@gmail.com To kick this off, I have created a design document that is open for comments. Much detail is needed here. I will create a JIRA as well, but the google doc is much easier for collating lots of input into a coherent document. The directory that the document is stored in is accessible at http:// bit.ly/18vbbaT http://bit.ly/18vbbaT Once we get going, we can talk about how to coordinate tasks between hangouts. One option is a public Trello project: https://trello.com/ or we can use JIRA sub-tasks. On Sat, Jul 20, 2013 at 11:25 AM, Andrew Psaltis andrew.psal...@webtrends.com wrote: I am very interested in collaborating on the off-line to Solr part. Just let me know how we want to get going. Thanks, Andrew On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: I'm currently working for a portal that has a similar use case and I was thinking of implementing this in a similar way. I'm generating recommendations using python scripts based on similarity measures (content based recommendation) only using euclidean distance and some weights for each attribute. I want to use mahout's GenericItemBasedRecommender to generate these same recommendations without user data (no tracking right now of user to item relationship). I was thinking of pushing the generated recommendations to solr using atomic updates since my fields are all stored right now. Since this is very similar to what I'm trying to accomplish, I would sign up to collaborate in any way I can since I'm fairly familiar with solr and I'm starting to learn my way around mahout. On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org wrote: I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06
Re: Setting up a recommender
Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful.
Re: Setting up a recommender
Paper and presentation are very interesting to me as well. I am fairly new to this, and coming to terms with some of the terms, etc. I assume that action matrix here is just the raw matrix of how each user has interacted with the items/types-of-items. I didn't quite get the incorporation into SOLR (not familiar with that much, either), in particular the indexing related to the generated (root LLR-based?) co-occurence matrices for the different types of things so that it can be used in searches - so, a real newbie question: how can the co-occurence matrix be implemented as a search index in SOLR? Just point me at the RTFM docs is fine :) On Sun, Jul 21, 2013 at 5:17 PM, Pat Ferrel pat.fer...@gmail.com wrote: Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. -- BF Lyon http://www.nowherenearithaca.com
Re: Setting up a recommender
Pat, Yes. The first part probably just is the RowSimilarity job, especially after Sebastian puts in the down-sampling. The new part is exactly as you say, storing the DRM into Solr indexes. There is no reason to not use a real data set. There is a strong reason to use a synthetic dataset, however, in that it can be trivially scaled up and down both in items and users. Also, the synthetic dataset doesn't require that the real data be found and downloaded. On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote: Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful.
Re: Setting up a recommender
On Sun, Jul 21, 2013 at 8:10 AM, Iker Huerga iker.hue...@gmail.com wrote: I think a conference call, maybe a hangout, to kick off the project would be useful, who should schedule it? I will shortly do that. I think that I will need more than one kickoff to deal with timezones. I will coordinate these ahead of time on the mailing list. Due to the limitations[1] of Google hangouts with regard to saving and scheduling ahead of time, I will only be able to get the actual URL just shortly before the scheduled time. I will mail that to the mailing list and also put the URL into the shared design directory, probably in a spreadsheet. The meetings will be visible on Youtube afterwards. [1] the problem here is that I have been able to schedule a hangout, but not save that hangout to youtube. I have also been able to save an unscheduled meeetup, but was unable to figure out how to get a URL for such a hangout ahead of time. This may have changed, but I will still work around it this time to be sure we will succeed.
Re: Setting up a recommender
RowSimilarity downsampling? Are you referring to the a mod of the matrix multiply to do cross similarity with LLR for the cross recommendations? So similarity of rows of B with rows of A? Sounds like you are proposing not only putting a recommender in Solr but also a cross-recommender? This is why getting a real data set is problematic? On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, Yes. The first part probably just is the RowSimilarity job, especially after Sebastian puts in the down-sampling. The new part is exactly as you say, storing the DRM into Solr indexes. There is no reason to not use a real data set. There is a strong reason to use a synthetic dataset, however, in that it can be trivially scaled up and down both in items and users. Also, the synthetic dataset doesn't require that the real data be found and downloaded. On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote: Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful.
Re: Setting up a recommender
The row similarity downsampling is just a matter of dropping elements at random from rows that have more data than we want. If the join that puts the row together can handle two kinds of input, then RowSimilarity can be easily modified to be CrossRowSimilarity. Likewise, if we have two DRM's with the same row id's in the same order, we can do a map-side merge. Such a merge can be very efficient on a system like MapR where you can control files to live on the same nodes. On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel pat.fer...@gmail.com wrote: RowSimilarity downsampling? Are you referring to the a mod of the matrix multiply to do cross similarity with LLR for the cross recommendations? So similarity of rows of B with rows of A? Sounds like you are proposing not only putting a recommender in Solr but also a cross-recommender? This is why getting a real data set is problematic? On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, Yes. The first part probably just is the RowSimilarity job, especially after Sebastian puts in the down-sampling. The new part is exactly as you say, storing the DRM into Solr indexes. There is no reason to not use a real data set. There is a strong reason to use a synthetic dataset, however, in that it can be trivially scaled up and down both in items and users. Also, the synthetic dataset doesn't require that the real data be found and downloaded. On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote: Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful.
Re: Setting up a recommender
At the moment, the down sampling is done by PreparePreferenceMatrixJob for the collaborative filtering functionality. We just want to move it down to RowSimilarityJob to enable standalone usage. I think that the CrossRecommender should be the next thing on our agenda, after we have the deployment infrastructure. I especially like that its capable to include different kinds of interactions, as opposed to most other (academically motivated) recommenders that focus on a single interaction type like a rating. --sebastian On 22.07.2013 02:14, Ted Dunning wrote: The row similarity downsampling is just a matter of dropping elements at random from rows that have more data than we want. If the join that puts the row together can handle two kinds of input, then RowSimilarity can be easily modified to be CrossRowSimilarity. Likewise, if we have two DRM's with the same row id's in the same order, we can do a map-side merge. Such a merge can be very efficient on a system like MapR where you can control files to live on the same nodes. On Sun, Jul 21, 2013 at 4:43 PM, Pat Ferrel pat.fer...@gmail.com wrote: RowSimilarity downsampling? Are you referring to the a mod of the matrix multiply to do cross similarity with LLR for the cross recommendations? So similarity of rows of B with rows of A? Sounds like you are proposing not only putting a recommender in Solr but also a cross-recommender? This is why getting a real data set is problematic? On Jul 21, 2013, at 3:40 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, Yes. The first part probably just is the RowSimilarity job, especially after Sebastian puts in the down-sampling. The new part is exactly as you say, storing the DRM into Solr indexes. There is no reason to not use a real data set. There is a strong reason to use a synthetic dataset, however, in that it can be trivially scaled up and down both in items and users. Also, the synthetic dataset doesn't require that the real data be found and downloaded. On Sun, Jul 21, 2013 at 2:17 PM, Pat Ferrel pat.fer...@gmail.com wrote: Read the paper, and the preso. As to the 'offline to Solr' part. It sounds like you are suggesting an item item similarity matrix be stored and indexed in Solr. One would have to create the action matrix from user profile data (preference history), do a rowsimiarity job on it (using LLR similarity) and move the result to Solr. The first part of this is nearly identical to the current recommender job workflow and could pretty easily be created from it I think. The new part is taking the DistributedRowMatrix and storing it in a particular way in Solr, right? BTW Is there some reason not to use an existing real data set? On Jul 19, 2013, at 3:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful.
Re: Setting up a recommender
Hello, if there is a high demand for this functionality my company (http://www.apaxo.de/us/recitems.html) could implement this. Nevertheless we can't do it for free. So if it is possible to get a shared budget from everybody who is interested in this then it would be possible to write it. The codehaus JIRA has an incentive functionality: https://secure.donay.com/site/index Perhaps this might also be useful for the Mahout (a.k.a. Apache) JIRA. /Manuel Am 20.07.2013 um 00:45 schrieb Ted Dunning: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: I'm currently working for a portal that has a similar use case and I was thinking of implementing this in a similar way. I'm generating recommendations using python scripts based on similarity measures (content based recommendation) only using euclidean distance and some weights for each attribute. I want to use mahout's GenericItemBasedRecommender to generate these same recommendations without user data (no tracking right now of user to item relationship). I was thinking of pushing the generated recommendations to solr using atomic updates since my fields are all stored right now. Since this is very similar to what I'm trying to accomplish, I would sign up to collaborate in any way I can since I'm fairly familiar with solr and I'm starting to learn my way around mahout. On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org wrote: I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in
Re: Setting up a recommender
I am very interested in collaborating on the off-line to Solr part. Just let me know how we want to get going. Thanks, Andrew On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: I'm currently working for a portal that has a similar use case and I was thinking of implementing this in a similar way. I'm generating recommendations using python scripts based on similarity measures (content based recommendation) only using euclidean distance and some weights for each attribute. I want to use mahout's GenericItemBasedRecommender to generate these same recommendations without user data (no tracking right now of user to item relationship). I was thinking of pushing the generated recommendations to solr using atomic updates since my fields are all stored right now. Since this is very similar to what I'm trying to accomplish, I would sign up to collaborate in any way I can since I'm fairly familiar with solr and I'm starting to learn my way around mahout. On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org wrote: I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance,
Re: Setting up a recommender
To kick this off, I have created a design document that is open for comments. Much detail is needed here. I will create a JIRA as well, but the google doc is much easier for collating lots of input into a coherent document. The directory that the document is stored in is accessible at http:// bit.ly/18vbbaT http://bit.ly/18vbbaT Once we get going, we can talk about how to coordinate tasks between hangouts. One option is a public Trello project: https://trello.com/ or we can use JIRA sub-tasks. On Sat, Jul 20, 2013 at 11:25 AM, Andrew Psaltis andrew.psal...@webtrends.com wrote: I am very interested in collaborating on the off-line to Solr part. Just let me know how we want to get going. Thanks, Andrew On 7/19/13 4:45 PM, Ted Dunning ted.dunn...@gmail.com wrote: OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: I'm currently working for a portal that has a similar use case and I was thinking of implementing this in a similar way. I'm generating recommendations using python scripts based on similarity measures (content based recommendation) only using euclidean distance and some weights for each attribute. I want to use mahout's GenericItemBasedRecommender to generate these same recommendations without user data (no tracking right now of user to item relationship). I was thinking of pushing the generated recommendations to solr using atomic updates since my fields are all stored right now. Since this is very similar to what I'm trying to accomplish, I would sign up to collaborate in any way I can since I'm fairly familiar with solr and I'm starting to learn my way around mahout. On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org wrote: I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs
Re: Setting up a recommender
My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, podem conter informação privilegiada ou confidencial e são de uso exclusivo da pessoa ou entidade de destino. Se não for destinatário desta mensagem, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente por esta mesma via e, em seguida, apague-a. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted él destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le pedimos que nos lo comunique inmediatamente por esta misma vía y proceda a su exclusión. The information contained in this transmissión is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.
Re: Setting up a recommender
It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, podem conter informação privilegiada ou confidencial e são de uso exclusivo da pessoa ou entidade de destino. Se não for destinatário desta mensagem, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente por esta mesma via e, em seguida, apague-a. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted él destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le pedimos que nos lo comunique inmediatamente por esta misma vía y proceda a su exclusión. The information contained in this transmissión is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.
Re: Setting up a recommender
+1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.comwrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, podem conter informação privilegiada ou confidencial e são de uso exclusivo da pessoa ou entidade de destino. Se não for destinatário desta mensagem, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente por esta mesma via e, em seguida, apague-a. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted él destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le pedimos que nos lo comunique inmediatamente por esta misma vía y proceda a su exclusión. The information contained in this transmissión is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. -- BF Lyon http://www.nowherenearithaca.com
Re: Setting up a recommender
On Fri, Jul 19, 2013 at 12:59 PM, Ted Dunning ted.dunn...@gmail.com wrote: My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine you mean like Lucene / Katta? to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, podem conter informação privilegiada ou confidencial e são de uso exclusivo da pessoa ou entidade de destino. Se não for destinatário desta mensagem, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente por esta mesma via e, em seguida, apague-a. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted él destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le pedimos que nos lo comunique inmediatamente por esta misma vía y proceda a su exclusión. The information contained in this transmissión is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.
Re: Setting up a recommender
I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, podem conter informação privilegiada ou confidencial e são de uso exclusivo da pessoa ou entidade de destino. Se não for destinatário desta mensagem, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente por esta mesma via e, em seguida, apague-a. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted él destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le pedimos que nos lo comunique inmediatamente por esta misma vía y proceda a su exclusión. The information contained in this transmissión is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it. -- BF Lyon
Re: Setting up a recommender
I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, podem conter informação privilegiada ou confidencial e são de uso exclusivo da pessoa ou entidade de destino. Se não for destinatário desta mensagem, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por engano, pedimos que nos o comunique imediatamente por esta mesma via e, em seguida, apague-a. Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted él destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le pedimos que nos lo comunique inmediatamente por esta misma vía y proceda a su exclusión. The information contained in this transmissión is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination,
Re: Setting up a recommender
OK. I think the crux here is the off-line to Solr part so let's see who else pops up. Having a solr maven could be very helpful. On Fri, Jul 19, 2013 at 3:39 PM, Luis Carlos Guerrero Covo lcguerreroc...@gmail.com wrote: I'm currently working for a portal that has a similar use case and I was thinking of implementing this in a similar way. I'm generating recommendations using python scripts based on similarity measures (content based recommendation) only using euclidean distance and some weights for each attribute. I want to use mahout's GenericItemBasedRecommender to generate these same recommendations without user data (no tracking right now of user to item relationship). I was thinking of pushing the generated recommendations to solr using atomic updates since my fields are all stored right now. Since this is very similar to what I'm trying to accomplish, I would sign up to collaborate in any way I can since I'm fairly familiar with solr and I'm starting to learn my way around mahout. On Fri, Jul 19, 2013 at 5:12 PM, Sebastian Schelter s...@apache.org wrote: I would also be willing to provide guidance and advice for anyone taking this on, I can especially help with the offline analysis part. --sebastian 2013/7/19 Ted Dunning ted.dunn...@gmail.com I would be happy to supervise a project to implement a demo of this if anybody is willing to do the grunt work of gluing things together. Sooo, if you would like to work on this, here is a suggested project. This project would entail: a) build a synthetic data source b) write scripts to do the off-line analysis c) write scripts to export to Solr d) write a very quick web facade over Solr to make it look like a recommendation engine. This would include d.1) a most popular page that does combined popularity rise and recommendation d.2) a personal recommendation page that does just recommendation with dithering d.3) item pages with related items at the bottom e) work with others to provide high quality system walk-through and install directions If you want to bite on this, we should arrange a weekly video hangout. I am willing to commit to guiding and providing detailed technical approaches. You should be willing to commit to actually doing stuff. The goal would be to provide a fully worked out scaffolding of a practical recommendation system that presumably would become an example module in Mahout. On Fri, Jul 19, 2013 at 1:08 PM, B Lyon bradfl...@gmail.com wrote: +1 as well. Sounds fun. On Fri, Jul 19, 2013 at 4:06 PM, Dominik Hübner cont...@dhuebner.com wrote: +1 for getting something like that in a future release of Mahout On Jul 19, 2013, at 10:02 PM, Sebastian Schelter s...@apache.org wrote: It would be awesome if we could get a nice, easily deployable implementation of that approach into Mahout before 1.0 2013/7/19 Ted Dunning ted.dunn...@gmail.com My current advice is to use Hadoop (if necessary) to build a sparse item-item matrix based on each kind of behavior you have and then drop those similarities into a search engine to deliver the actual recommendations. This allows lots of flexibility in terms of which kinds of inputs you use for the recommendation and lets you blend recommendations with search and geo-location. On Fri, Jul 19, 2013 at 12:33 PM, Helder Martins helder.ga...@corp.terra.com.br wrote: Hi, I'm a dev working for a web portal in Brazil and I'm particularly interested in building a item-based collaborative filtering recommender for our database of news articles. After some coding, I was able to get some recommendations using a GenericItemBasedRecommender, a CassandraDataModel and some custom classes that store item similarities and migrated item IDs into Cassandra. But know I'm in doubt of what is normally done with this recommender: Should I run this as a daemon, cache the recommendations into memory and set up a web service to consult it online? Should I pre process these recommendations for each recent user and store it somewhere? My first idea was storing all these recs back into Cassandra, but looking into some classes it seems to me that the norm is to read the input data and store the output always using files. Is this a common practice that benefits from HDFS? My use case here is something around 70k recommendations requests per second. Thanks in advance, -- Atenciosamente Helder Martins Arquitetura do Portal e Sistemas de Backend +55 (51) 3284-4475 Terra Esta mensagem e seus anexos se