*Pat*, I opened a ticket(M-1420) for putting a new script in examples/ that uses the solr-recommender. Seems there's another, related ticket from Suneel in M-1288.
Did the work described in the thread below make it into 0.9, and/or how much more is needed on it? *Ted*, if you have any code you could donate for this example from your and Ellen's book I'd love to be able to re-use it. Thanks Andrew On Sun, Nov 17, 2013 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Eventually I'd like to get MAP built into the solr-recommender. Used it at > a client who had good data. It was very helpful for exploring what data was > useful and what wasn't. We'd run map with and without detail-view data for > instance and take the MAP as a measure of how predictive the data was. In > our case the MAP@ numbers went down with purchase and detail-view mixed > together. That was why I got interested in the cross-action recommender--as > a way to scrub less predictive actions. Didn't finish it before I lost > access to the data unfortunately. > > What form of precision calc will you use? Obviously we used mean average > precision at different numbers of recommendations, which had the effect of > producing a fall-off curve. The curve, we took, as a measure of how well > our ranking was working. > > On Nov 17, 2013, at 10:47 AM, Ken Krugler <kkrugler_li...@transpac.com> > wrote: > > Hi Pat, > > On Nov 13, 2013, at 4:43pm, Pat Ferrel <pat.fer...@gmail.com> wrote: > > > Ever done an offline precision calc? > > No, sorry. > > I do (finally) have one client with some data that could be used to > calculate precision, and a willingness to pay for the work, so I'm hoping > to include details on that in my next blog post about text feature > selection. > > -- Ken > > > >> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kkrugler_li...@transpac.com> > wrote: > >> > >> Hi Pat, > >> > >>> On Nov 13, 2013, at 9:21am, Pat Ferrel <p...@occamsmachete.com> wrote: > >>> > >>> A version is now checked in that uses mahout 0.9. Haven't tested it on > a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1, > which takes some time. > >>> > >>> Saw the Strata slides from Ted touting dithering of results, which > I'll implement. > >>> > >>> Ken, did you have anything specific for "And usually I just use Solr > to generate a candidate list, then I do more specific scoring to find the N > best form N*4 candidates"? > >> > >> If I'm looking for the top N best matches, I'll do a Solr query with > rows=N*4. > >> > >> Then I use all of the data from these potential matches, and calculate > a more sophisticated similarity score (e.g. adding a weighting based on the > user's activity level) between my target and these candidates. > >> > >> Regards, > >> > >> -- Ken > >> > >>> > >>> Was planning to try boosting by something like genre/category in the > recs query. For instance, in the demo data, each item will soon have a set > of tags (actually genre names) so these could be a field being queried > along with the item-item links. The query for recs would then include the > user history against the item-item links, and the average genre tags > preferred by the user against item genre tags. This would return recs > skewed towards the user's genre preference. > >>> > >>> Another way this could be used is when showing similar items. You'd > have the tags for the item being viewed and so could use them to skew > towards items with similar tags. I think this works but would turn similar > items from a lookup (they are pre-calculated by Mahout) into another Solr > query. > >>> > >>> > >>> > >>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > >>> > >>> Not planning to do anything with weights at present. An ORed query > should suffice for the time being and Solr weights. There are a good list > of ways to do this later if it warrants an experiment. Thanks. > >>> > >>> Have, similar items as input, recommendations from user "likes", and > just got recs from recently viewed working. Once you have online recs from > the pre-calculated model experimenting is super easy. The next step will be > to get more metadata ingested so we can try boosting by context genre, or > recent genre viewed, which is sort of in line with "more specific scoring > to find the N best from N*4 candidates". Also want to do what Ted calls > dithering to vary the choices you see. > >>> > >>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kkrugler_li...@transpac.com> > wrote: > >>> > >>> One other thing I should have mentioned is that if you care about > setting weights on incoming terms, you can boost them using the ^<value> > syntax. > >>> > >>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0..." > >>> > >>> If you want to account for weights of terms in the index, it's a bit > harder. You can do simple boosting by replicating terms, or you can use > payload-based boosting, or you could code up your own Similarity class that > takes advantage of side-channel data. > >>> > >>> But in my experience the gain from applying weights to terms int he > index isn't very significant. > >>> > >>> And usually I just Solr to generate a candidate list, then I do more > specific scoring to find the N best form N*4 candidates. > >>> > >>> -- Ken > >>> > >>>> On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunn...@gmail.com> wrote: > >>>> > >>>> For recommendation work, I suggest that it would be better to simply > code > >>>> out an explicit OR query. > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler < > kkrugler_li...@transpac.com>wrote: > >>>> > >>>>> Hi Pat, > >>>>> > >>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.fer...@gmail.com> wrote: > >>>>>> > >>>>>> Another approach would be to weight the terms in the docs by there > >>>>> Mahout similarity strength. But that will be for another day. > >>>>>> > >>>>>> My current question is whether Lucene looks at word proximity. I > see the > >>>>> query syntax supports proximity but I don't see that it is default so > >>>>> that's good. > >>>>> > >>>>> Based on your description of what you do (generate an OR query of N > terms) > >>>>> then no, you shouldn't be getting a boost from proximity. > >>>>> > >>>>> Note that with edismax you can specify a phrase boost, but it will > be on > >>>>> the entire set of terms being searched, so unlikely to come into > play even > >>>>> if you were using that. > >>>>> > >>>>> -- Ken > >>>>> > >>>>> > >>>>>> > >>>>>> > >>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James < > james.d...@ingramcontent.com> > >>>>>> wrote: > >>>>>> > >>>>>> Best to my knowledge, Lucene does not care about the position of a > >>>>> keyword within a document. > >>>>>> > >>>>>> You could bucket the ids into several fields. Then use a dismax > query > >>>>> to boost the top-tier ids more than then second, etc. > >>>>>> > >>>>>> A more fine-grained approach would probably involve a custom > Similarity > >>>>> class that scales the score based on its position in the document. > If we > >>>>> did this, it might be simpler to index as 1 single-valued field so > each id > >>>>> was position+1 rather than position+100, etc. > >>>>>> > >>>>>> James Dyer > >>>>>> Ingram Content Group > >>>>>> (615) 213-4311 > >>>>>> > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com] > >>>>>> Sent: Thursday, November 07, 2013 1:46 PM > >>>>>> To: u...@mahout.apache.org > >>>>>> Subject: Re: Solr-recommender for Mahout 0.9 > >>>>>> > >>>>>> Interesting to think about ordering and adjacentness. The index ids > are > >>>>> sorted by Mahout strength so the first id is the most similar to the > row > >>>>> key and so forth. But the query is ordered buy recency. In both > cases the > >>>>> first id is in some sense the most important. Does Solr/Lucene care > about > >>>>> closeness to the top of doc for queries or indexed docs? I don't > recall any > >>>>> mention of this. > >>>>>> > >>>>>> However adjacentness has no meaning in recommendations though I > think > >>>>> it's used in default queries so I may have to account for that. > >>>>>> > >>>>>> The object returned is an ordered list of ids. I use only the IDs > now > >>>>> but there are cases when the contents are also of interest; shopping > >>>>> cart/watchlist queries for example. > >>>>>> > >>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James < > james.d...@ingramcontent.com> > >>>>>> wrote: > >>>>>> > >>>>>> The multivalued field will obey the "positionIncrementGap" value you > >>>>> specify (default=100). So for querying purposes, those id's will be > 100 > >>>>> (or whatever you specified) positions apart. So a phrase search for > >>>>> adjacent ids would not match, unless you set the slop for >= > >>>>> positionIncrementGap. Other than this, both scenarios index the > same. > >>>>>> > >>>>>> For stored fields, solr returns an array of values for multivalued > >>>>> fields, which is convienent when writing a UI. > >>>>>> > >>>>>> James Dyer > >>>>>> Ingram Content Group > >>>>>> (615) 213-4311 > >>>>>> > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: Dominik Hübner [mailto:cont...@dhuebner.com] > >>>>>> Sent: Thursday, November 07, 2013 11:23 AM > >>>>>> To: u...@mahout.apache.org > >>>>>> Subject: Re: Solr-recommender for Mahout 0.9 > >>>>>> > >>>>>> Does anyone know what the difference is between keeping the ids in a > >>>>> space delimited string and indexing a multivalued field of ids? I > recently > >>>>> tried the latter since ... it felt right, however I am not sure > which of > >>>>> both has which advantages. > >>>>>> > >>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.fer...@gmail.com> wrote: > >>>>>>> > >>>>>>> I have dismax (no edismax) but am not using it yet, using the > default > >>>>> query, which does use 'AND'. I had much the same though as I slept > on it. > >>>>> Changing to OR is now working much much better. So obvious it almost > bit > >>>>> me, not good in this case... > >>>>>>> > >>>>>>> With only a trivially small amount of testing I'd say we have a new > >>>>> recommender on the block. > >>>>>>> > >>>>>>> If anyone would like to help eyeball test the thing let me know > >>>>> off-list. There are a few instructions I'll need to give. And it > can't > >>>>> handle much load right now due to intentional design limits. > >>>>>>> > >>>>>>> > >>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James < > james.d...@ingramcontent.com> > >>>>> wrote: > >>>>>>> > >>>>>>> Pat, > >>>>>>> > >>>>>>> Can you give us the query it generates when you enter "vampire > werewolf > >>>>> zombie", q/qt/defType ? > >>>>>>> > >>>>>>> My guess is you're using the default query parser with "q.op=AND" > , or, > >>>>> you're using dismax/edismax with a high "mm" (min-must-match) value. > >>>>>>> > >>>>>>> James Dyer > >>>>>>> Ingram Content Group > >>>>>>> (615) 213-4311 > >>>>>>> > >>>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com] > >>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM > >>>>>>> To: s...@apache.org Schelter; u...@mahout.apache.org > >>>>>>> Subject: Re: Solr-recommender for Mahout 0.9 > >>>>>>> > >>>>>>> Done, > >>>>>>> > >>>>>>> BTW I have the thing running on a demo site but am getting very > poor > >>>>> results that I think are related to the Solr setup. I'd appreciate > any > >>>>> ideas. > >>>>>>> > >>>>>>> The sample data has 27,000 items and something like 4000 users. The > >>>>> preference data is fairly dense since the users are professional > reviewers > >>>>> and the items videos. > >>>>>>> > >>>>>>> 1) The number of item-item similarities that are kept is 100. Is > this a > >>>>> good starting point? Ted, do you recall how many you used before? > >>>>>>> 2) The query is a simple text query made of space delimited video > id > >>>>> strings. These are the same ids as are stored in the item-item > similarity > >>>>> docs that Solr indexes. > >>>>>>> > >>>>>>> Hit thumbs up on one video you you get several recommendations. Hit > >>>>> thumbs up on several videos you get no recs. I'm either using the > wrong > >>>>> query type or have it set up to be too restrictive. As I read > through the > >>>>> docs if someone has a suggestion or pointer I'd appreciate it. > >>>>>>> > >>>>>>> BTW the same sort of thing happens with Title search. Search for > >>>>> "vampire werewolf zombie" you get no results, search for "zombie" > you get > >>>>> several. > >>>>>>> > >>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <s...@apache.org> > wrote: > >>>>>>> > >>>>>>> Hi Pat, > >>>>>>> > >>>>>>> can you create issues for 1) and 2) ? Then I will try to get this > into > >>>>>>> trunk asap. > >>>>>>> > >>>>>>> Best, > >>>>>>> Sebastian > >>>>>>> > >>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote: > >>>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout > >>>>> snapshot. The project uses a modified RecommenderJob because it needs > >>>>> SequenceFile output and to get the location of the > preparePreferenceMatrix > >>>>> directory. If #1 and #2 are addressed I can remove the modified > Mahout code > >>>>> from the project and rely on the default implementations in Mahout > 0.9. #3 > >>>>> is a longer term issue related to the creation of a > CrossRowSimilarityJob. > >>>>>>>> > >>>>>>>> I have dropped the modified code from the Solr-recommender > project and > >>>>> have a modified build of the current Mahout 0.9 snapshot. If the > following > >>>>> changes are made to Mahout I can test and release a Mahout 0.9 > version of > >>>>> the Solr-recommender. > >>>>>>>> > >>>>>>>> 1. Option to change RecommenderJob output format > >>>>>>>> > >>>>>>>> Can someone add an option to output a SequenceFile. I modified the > >>>>> code to do the following, note the SequenceFileOutputFormat.class as > the > >>>>> last parameter but this should really be determined with an option I > think. > >>>>>>>> > >>>>>>>> Job aggregateAndRecommend = prepareJob( > >>>>>>>> new Path(aggregateAndRecommendInput), outputPath, > >>>>> SequenceFileInputFormat.class, > >>>>>>>> PartialMultiplyMapper.class, VarLongWritable.class, > >>>>> PrefAndSimilarityColumnWritable.class, > >>>>>>>> AggregateAndRecommendReducer.class, VarLongWritable.class, > >>>>> RecommendedItemsWritable.class, > >>>>>>>> SequenceFileOutputFormat.class); > >>>>>>>> > >>>>>>>> 2. Visibility of preparePreferenceMatrix directory location > >>>>>>>> > >>>>>>>> The Solr-recommender needs to find where the RecommenderJob is > putting > >>>>> it's output. > >>>>>>>> > >>>>>>>> Mahout 0.8 RecommenderJob code was: > >>>>>>>> public static final String DEFAULT_PREPARE_DIR = > >>>>> "preparePreferenceMatrix"; > >>>>>>>> > >>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" > >>>>> inline in the code: > >>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix"); > >>>>>>>> > >>>>>>>> This change to Mahout 0.9 works: > >>>>>>>> public static final String DEFAULT_PREPARE_DIR = > >>>>> "preparePreferenceMatrix"; > >>>>>>>> and > >>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR); > >>>>>>>> > >>>>>>>> You could also make this a getter method on the RecommenderJob > Class > >>>>> instead of using a public constant. > >>>>>>>> > >>>>>>>> 3. Downsampling > >>>>>>>> > >>>>>>>> The downsampling for maximum prefs per user has been moved from > >>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob > uses > >>>>> matrix math instead of RSJ so it will no longer support downsampling > until > >>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in > it. > >>>>> > >>>>> -------------------------- > >>>>> Ken Krugler > >>>>> +1 530-210-6378 > >>>>> http://www.scaleunlimited.com > >>>>> custom big data solutions & training > >>>>> Hadoop, Cascading, Cassandra & Solr > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> -------------------------- > >>>>> Ken Krugler > >>>>> +1 530-210-6378 > >>>>> http://www.scaleunlimited.com > >>>>> custom big data solutions & training > >>>>> Hadoop, Cascading, Cassandra & Solr > >>> > >>> -------------------------- > >>> Ken Krugler > >>> +1 530-210-6378 > >>> http://www.scaleunlimited.com > >>> custom big data solutions & training > >>> Hadoop, Cascading, Cassandra & Solr > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> -------------------------- > >>> Ken Krugler > >>> +1 530-210-6378 > >>> http://www.scaleunlimited.com > >>> custom big data solutions & training > >>> Hadoop, Cascading, Cassandra & Solr > >> > >> -------------------------- > >> Ken Krugler > >> +1 530-210-6378 > >> http://www.scaleunlimited.com > >> custom big data solutions & training > >> Hadoop, Cascading, Cassandra & Solr > >> > >> > >> > >> > >> > >> > >> > >> -------------------------- > >> Ken Krugler > >> +1 530-210-6378 > >> http://www.scaleunlimited.com > >> custom big data solutions & training > >> Hadoop, Cascading, Cassandra & Solr > >> > >> > >> > >> > >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > > > >