Re: Solr-recommender for Mahout 0.9

Andrew Musselman Sat, 22 Feb 2014 16:51:07 -0800

*Pat*, I opened a ticket(M-1420) for putting a new script in examples/ that
uses the solr-recommender.  Seems there's another, related ticket from
Suneel in M-1288.


Did the work described in the thread below make it into 0.9, and/or how
much more is needed on it?

*Ted*, if you have any code you could donate for this example from your and
Ellen's book I'd love to be able to re-use it.

Thanks
Andrew

On Sun, Nov 17, 2013 at 3:36 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Eventually I'd like to get MAP built into the solr-recommender. Used it at
> a client who had good data. It was very helpful for exploring what data was
> useful and what wasn't. We'd run map with and without detail-view data for
> instance and take the MAP as a measure of how predictive the data was. In
> our case the MAP@ numbers went down with purchase and detail-view mixed
> together. That was why I got interested in the cross-action recommender--as
> a way to scrub less predictive actions. Didn't finish it before I lost
> access to the data unfortunately.
>
> What form of precision calc will you use? Obviously we used mean average
> precision at different numbers of recommendations, which had the effect of
> producing a fall-off curve. The curve, we took, as a measure of how well
> our ranking was working.
>
> On Nov 17, 2013, at 10:47 AM, Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
>
> Hi Pat,
>
> On Nov 13, 2013, at 4:43pm, Pat Ferrel <pat.fer...@gmail.com> wrote:
>
> > Ever done an offline precision calc?
>
> No, sorry.
>
> I do (finally) have one client with some data that could be used to
> calculate precision, and a willingness to pay for the work, so I'm hoping
> to include details on that in my next blog post about text feature
> selection.
>
> -- Ken
>
>
> >> On Nov 13, 2013, at 1:39 PM, Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
> >>
> >> Hi Pat,
> >>
> >>> On Nov 13, 2013, at 9:21am, Pat Ferrel <p...@occamsmachete.com> wrote:
> >>>
> >>> A version is now checked in that uses mahout 0.9. Haven't tested it on
> a cluster yet, only locally. I have to upgrade my cluster to Hadoop 1.2.1,
> which takes some time.
> >>>
> >>> Saw the Strata slides from Ted touting dithering of results, which
> I'll implement.
> >>>
> >>> Ken, did you have anything specific for "And usually I just use Solr
> to generate a candidate list, then I do more specific scoring to find the N
> best form N*4 candidates"?
> >>
> >> If I'm looking for the top N best matches, I'll do a Solr query with
> rows=N*4.
> >>
> >> Then I use all of the data from these potential matches, and calculate
> a more sophisticated similarity score (e.g. adding a weighting based on the
> user's activity level) between my target and these candidates.
> >>
> >> Regards,
> >>
> >> -- Ken
> >>
> >>>
> >>> Was planning to try boosting by something like genre/category in the
> recs query. For instance, in the demo data, each item will soon have a set
> of tags (actually genre names) so these could be a field being queried
> along with the item-item links. The query for recs would then include the
> user history against the item-item links, and the average genre tags
> preferred by the user against item genre tags. This would return recs
> skewed towards the user's genre preference.
> >>>
> >>> Another way this could be used is when showing similar items. You'd
> have the tags for the item being viewed and so could use them to skew
> towards items with similar tags. I think this works but would turn similar
> items from a lookup (they are pre-calculated by Mahout) into another Solr
> query.
> >>>
> >>>
> >>>
> >>> On Nov 8, 2013, at 1:27 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> >>>
> >>> Not planning to do anything with weights at present. An ORed query
> should suffice for the time being and Solr weights. There are a good list
> of ways to do this later if it warrants an experiment. Thanks.
> >>>
> >>> Have, similar items as input, recommendations from user "likes", and
> just got recs from recently viewed working. Once you have online recs from
> the pre-calculated model experimenting is super easy. The next step will be
> to get more metadata ingested so we can try boosting by context genre, or
> recent genre viewed, which is sort of in line with "more specific scoring
> to find the N best from N*4 candidates". Also want to do what Ted calls
> dithering to vary the choices you see.
> >>>
> >>> On Nov 8, 2013, at 10:10 AM, Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
> >>>
> >>> One other thing I should have mentioned is that if you care about
> setting weights on incoming terms, you can boost them using the ^<value>
> syntax.
> >>>
> >>> E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0..."
> >>>
> >>> If you want to account for weights of terms in the index, it's a bit
> harder. You can do simple boosting by replicating terms, or you can use
> payload-based boosting, or you could code up your own Similarity class that
> takes advantage of side-channel data.
> >>>
> >>> But in my experience the gain from applying weights to terms int he
> index isn't very significant.
> >>>
> >>> And usually I just Solr to generate a candidate list, then I do more
> specific scoring to find the N best form N*4 candidates.
> >>>
> >>> -- Ken
> >>>
> >>>> On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunn...@gmail.com> wrote:
> >>>>
> >>>> For recommendation work, I suggest that it would be better to simply
> code
> >>>> out an explicit OR query.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler <
> kkrugler_li...@transpac.com>wrote:
> >>>>
> >>>>> Hi Pat,
> >>>>>
> >>>>>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.fer...@gmail.com> wrote:
> >>>>>>
> >>>>>> Another approach would be to weight the terms in the docs by there
> >>>>> Mahout similarity strength. But that will be for another day.
> >>>>>>
> >>>>>> My current question is whether Lucene looks at word proximity. I
> see the
> >>>>> query syntax supports proximity but I don't see that it is default so
> >>>>> that's good.
> >>>>>
> >>>>> Based on your description of what you do (generate an OR query of N
> terms)
> >>>>> then no, you shouldn't be getting a boost from proximity.
> >>>>>
> >>>>> Note that with edismax you can specify a phrase boost, but it will
> be on
> >>>>> the entire set of terms being searched, so unlikely to come into
> play even
> >>>>> if you were using that.
> >>>>>
> >>>>> -- Ken
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <
> james.d...@ingramcontent.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Best to my knowledge, Lucene does not care about the position of a
> >>>>> keyword within a document.
> >>>>>>
> >>>>>> You could bucket the ids into several fields.  Then use a dismax
> query
> >>>>> to boost the top-tier ids more than then second, etc.
> >>>>>>
> >>>>>> A more fine-grained approach would probably involve a custom
> Similarity
> >>>>> class that scales the score based on its position in the document.
>  If we
> >>>>> did this, it might be simpler to index as 1 single-valued field so
> each id
> >>>>> was position+1 rather than position+100, etc.
> >>>>>>
> >>>>>> James Dyer
> >>>>>> Ingram Content Group
> >>>>>> (615) 213-4311
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com]
> >>>>>> Sent: Thursday, November 07, 2013 1:46 PM
> >>>>>> To: u...@mahout.apache.org
> >>>>>> Subject: Re: Solr-recommender for Mahout 0.9
> >>>>>>
> >>>>>> Interesting to think about ordering and adjacentness. The index ids
> are
> >>>>> sorted by Mahout strength so the first id is the most similar to the
> row
> >>>>> key and so forth. But the query is ordered buy recency. In both
> cases the
> >>>>> first id is in some sense the most important. Does Solr/Lucene care
> about
> >>>>> closeness to the top of doc for queries or indexed docs? I don't
> recall any
> >>>>> mention of this.
> >>>>>>
> >>>>>> However adjacentness has no meaning in recommendations though I
> think
> >>>>> it's used in default queries so I may have to account for that.
> >>>>>>
> >>>>>> The object returned is an ordered list of ids. I use only the IDs
> now
> >>>>> but there are cases when the contents are also of interest; shopping
> >>>>> cart/watchlist queries for example.
> >>>>>>
> >>>>>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <
> james.d...@ingramcontent.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>> The multivalued field will obey the "positionIncrementGap" value you
> >>>>> specify (default=100).  So for querying purposes, those id's will be
> 100
> >>>>> (or whatever you specified) positions apart.  So a phrase search for
> >>>>> adjacent ids would not match, unless you set the slop for >=
> >>>>> positionIncrementGap.  Other than this, both scenarios index the
> same.
> >>>>>>
> >>>>>> For stored fields, solr returns an array of values for multivalued
> >>>>> fields, which is convienent when writing a UI.
> >>>>>>
> >>>>>> James Dyer
> >>>>>> Ingram Content Group
> >>>>>> (615) 213-4311
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Dominik Hübner [mailto:cont...@dhuebner.com]
> >>>>>> Sent: Thursday, November 07, 2013 11:23 AM
> >>>>>> To: u...@mahout.apache.org
> >>>>>> Subject: Re: Solr-recommender for Mahout 0.9
> >>>>>>
> >>>>>> Does anyone know what the difference is between keeping the ids in a
> >>>>> space delimited string and indexing a multivalued field of ids? I
> recently
> >>>>> tried the latter since ... it felt right, however I am not sure
> which of
> >>>>> both has which advantages.
> >>>>>>
> >>>>>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.fer...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> I have dismax (no edismax) but am not using it yet, using the
> default
> >>>>> query, which does use 'AND'. I had much the same though as I slept
> on it.
> >>>>> Changing to OR is now working much much better. So obvious it almost
> bit
> >>>>> me, not good in this case...
> >>>>>>>
> >>>>>>> With only a trivially small amount of testing I'd say we have a new
> >>>>> recommender on the block.
> >>>>>>>
> >>>>>>> If anyone would like to help eyeball test the thing let me know
> >>>>> off-list. There are a few instructions I'll need to give. And it
> can't
> >>>>> handle much load right now due to intentional design limits.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <
> james.d...@ingramcontent.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Pat,
> >>>>>>>
> >>>>>>> Can you give us the query it generates when you enter "vampire
> werewolf
> >>>>> zombie", q/qt/defType ?
> >>>>>>>
> >>>>>>> My guess is you're using the default query parser with "q.op=AND"
> , or,
> >>>>> you're using dismax/edismax with a high "mm" (min-must-match) value.
> >>>>>>>
> >>>>>>> James Dyer
> >>>>>>> Ingram Content Group
> >>>>>>> (615) 213-4311
> >>>>>>>
> >>>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com]
> >>>>>>> Sent: Wednesday, November 06, 2013 5:53 PM
> >>>>>>> To: s...@apache.org Schelter; u...@mahout.apache.org
> >>>>>>> Subject: Re: Solr-recommender for Mahout 0.9
> >>>>>>>
> >>>>>>> Done,
> >>>>>>>
> >>>>>>> BTW I have the thing running on a demo site but am getting very
> poor
> >>>>> results that I think are related to the Solr setup. I'd appreciate
> any
> >>>>> ideas.
> >>>>>>>
> >>>>>>> The sample data has 27,000 items and something like 4000 users. The
> >>>>> preference data is fairly dense since the users are professional
> reviewers
> >>>>> and the items videos.
> >>>>>>>
> >>>>>>> 1) The number of item-item similarities that are kept is 100. Is
> this a
> >>>>> good starting point? Ted, do you recall how many you used before?
> >>>>>>> 2) The query is a simple text query made of space delimited video
> id
> >>>>> strings. These are the same ids as are stored in the item-item
> similarity
> >>>>> docs that Solr indexes.
> >>>>>>>
> >>>>>>> Hit thumbs up on one video you you get several recommendations. Hit
> >>>>> thumbs up on several videos you get no recs. I'm either using the
> wrong
> >>>>> query type or have it set up to be too restrictive. As I read
> through the
> >>>>> docs if someone has a suggestion or pointer I'd appreciate it.
> >>>>>>>
> >>>>>>> BTW the same sort of thing happens with Title search. Search for
> >>>>> "vampire werewolf zombie" you get no results, search for "zombie"
> you get
> >>>>> several.
> >>>>>>>
> >>>>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <s...@apache.org>
> wrote:
> >>>>>>>
> >>>>>>> Hi Pat,
> >>>>>>>
> >>>>>>> can you create issues for 1) and 2) ? Then I will try to get this
> into
> >>>>>>> trunk asap.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Sebastian
> >>>>>>>
> >>>>>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
> >>>>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
> >>>>> snapshot. The project uses a modified RecommenderJob because it needs
> >>>>> SequenceFile output and to get the location of the
> preparePreferenceMatrix
> >>>>> directory. If #1 and #2 are addressed I can remove the modified
> Mahout code
> >>>>> from the project and rely on the default implementations in Mahout
> 0.9. #3
> >>>>> is a longer term issue related to the creation of a
> CrossRowSimilarityJob.
> >>>>>>>>
> >>>>>>>> I have dropped the modified code from the Solr-recommender
> project and
> >>>>> have a modified build of the current Mahout 0.9 snapshot. If the
> following
> >>>>> changes are made to Mahout I can test and release a Mahout 0.9
> version of
> >>>>> the Solr-recommender.
> >>>>>>>>
> >>>>>>>> 1. Option to change RecommenderJob output format
> >>>>>>>>
> >>>>>>>> Can someone add an option to output a SequenceFile. I modified the
> >>>>> code to do the following, note the SequenceFileOutputFormat.class as
> the
> >>>>> last parameter but this should really be determined with an option I
> think.
> >>>>>>>>
> >>>>>>>> Job aggregateAndRecommend = prepareJob(
> >>>>>>>>  new Path(aggregateAndRecommendInput), outputPath,
> >>>>> SequenceFileInputFormat.class,
> >>>>>>>>  PartialMultiplyMapper.class, VarLongWritable.class,
> >>>>> PrefAndSimilarityColumnWritable.class,
> >>>>>>>>  AggregateAndRecommendReducer.class, VarLongWritable.class,
> >>>>> RecommendedItemsWritable.class,
> >>>>>>>>  SequenceFileOutputFormat.class);
> >>>>>>>>
> >>>>>>>> 2. Visibility of preparePreferenceMatrix directory location
> >>>>>>>>
> >>>>>>>> The Solr-recommender needs to find where the RecommenderJob is
> putting
> >>>>> it's output.
> >>>>>>>>
> >>>>>>>> Mahout 0.8 RecommenderJob code was:
> >>>>>>>> public static final String DEFAULT_PREPARE_DIR =
> >>>>> "preparePreferenceMatrix";
> >>>>>>>>
> >>>>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
> >>>>> inline in the code:
> >>>>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
> >>>>>>>>
> >>>>>>>> This change to Mahout 0.9 works:
> >>>>>>>> public static final String DEFAULT_PREPARE_DIR =
> >>>>> "preparePreferenceMatrix";
> >>>>>>>> and
> >>>>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
> >>>>>>>>
> >>>>>>>> You could also make this a getter method on the RecommenderJob
> Class
> >>>>> instead of using a public constant.
> >>>>>>>>
> >>>>>>>> 3. Downsampling
> >>>>>>>>
> >>>>>>>> The downsampling for maximum prefs per user has been moved from
> >>>>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob
> uses
> >>>>> matrix math instead of RSJ so it will no longer support downsampling
> until
> >>>>> there is a hypothetical CrossRowSimilairtyJob with downsampling in
> it.
> >>>>>
> >>>>> --------------------------
> >>>>> Ken Krugler
> >>>>> +1 530-210-6378
> >>>>> http://www.scaleunlimited.com
> >>>>> custom big data solutions & training
> >>>>> Hadoop, Cascading, Cassandra & Solr
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --------------------------
> >>>>> Ken Krugler
> >>>>> +1 530-210-6378
> >>>>> http://www.scaleunlimited.com
> >>>>> custom big data solutions & training
> >>>>> Hadoop, Cascading, Cassandra & Solr
> >>>
> >>> --------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://www.scaleunlimited.com
> >>> custom big data solutions & training
> >>> Hadoop, Cascading, Cassandra & Solr
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --------------------------
> >>> Ken Krugler
> >>> +1 530-210-6378
> >>> http://www.scaleunlimited.com
> >>> custom big data solutions & training
> >>> Hadoop, Cascading, Cassandra & Solr
> >>
> >> --------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> --------------------------
> >> Ken Krugler
> >> +1 530-210-6378
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
> >>
> >>
> >>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>

Re: Solr-recommender for Mahout 0.9

Reply via email to