One other thing I should have mentioned is that if you care about setting weights on incoming terms, you can boost them using the ^<value> syntax.
E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…" If you want to account for weights of terms in the index, it's a bit harder. You can do simple boosting by replicating terms, or you can use payload-based boosting, or you could code up your own Similarity class that takes advantage of side-channel data. But in my experience the gain from applying weights to terms int he index isn't very significant. And usually I just Solr to generate a candidate list, then I do more specific scoring to find the N best form N*4 candidates. -- Ken On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunn...@gmail.com> wrote: > For recommendation work, I suggest that it would be better to simply code > out an explicit OR query. > > > > > On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler > <kkrugler_li...@transpac.com>wrote: > >> Hi Pat, >> >> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.fer...@gmail.com> wrote: >> >>> Another approach would be to weight the terms in the docs by there >> Mahout similarity strength. But that will be for another day. >>> >>> My current question is whether Lucene looks at word proximity. I see the >> query syntax supports proximity but I don’t see that it is default so >> that’s good. >> >> Based on your description of what you do (generate an OR query of N terms) >> then no, you shouldn't be getting a boost from proximity. >> >> Note that with edismax you can specify a phrase boost, but it will be on >> the entire set of terms being searched, so unlikely to come into play even >> if you were using that. >> >> -- Ken >> >> >>> >>> >>> On Nov 7, 2013, at 12:41 PM, Dyer, James <james.d...@ingramcontent.com> >> wrote: >>> >>> Best to my knowledge, Lucene does not care about the position of a >> keyword within a document. >>> >>> You could bucket the ids into several fields. Then use a dismax query >> to boost the top-tier ids more than then second, etc. >>> >>> A more fine-grained approach would probably involve a custom Similarity >> class that scales the score based on its position in the document. If we >> did this, it might be simpler to index as 1 single-valued field so each id >> was position+1 rather than position+100, etc. >>> >>> James Dyer >>> Ingram Content Group >>> (615) 213-4311 >>> >>> >>> -----Original Message----- >>> From: Pat Ferrel [mailto:pat.fer...@gmail.com] >>> Sent: Thursday, November 07, 2013 1:46 PM >>> To: user@mahout.apache.org >>> Subject: Re: Solr-recommender for Mahout 0.9 >>> >>> Interesting to think about ordering and adjacentness. The index ids are >> sorted by Mahout strength so the first id is the most similar to the row >> key and so forth. But the query is ordered buy recency. In both cases the >> first id is in some sense the most important. Does Solr/Lucene care about >> closeness to the top of doc for queries or indexed docs? I don't recall any >> mention of this. >>> >>> However adjacentness has no meaning in recommendations though I think >> it's used in default queries so I may have to account for that. >>> >>> The object returned is an ordered list of ids. I use only the IDs now >> but there are cases when the contents are also of interest; shopping >> cart/watchlist queries for example. >>> >>> On Nov 7, 2013, at 10:00 AM, Dyer, James <james.d...@ingramcontent.com> >> wrote: >>> >>> The multivalued field will obey the "positionIncrementGap" value you >> specify (default=100). So for querying purposes, those id's will be 100 >> (or whatever you specified) positions apart. So a phrase search for >> adjacent ids would not match, unless you set the slop for >= >> positionIncrementGap. Other than this, both scenarios index the same. >>> >>> For stored fields, solr returns an array of values for multivalued >> fields, which is convienent when writing a UI. >>> >>> James Dyer >>> Ingram Content Group >>> (615) 213-4311 >>> >>> >>> -----Original Message----- >>> From: Dominik Hübner [mailto:cont...@dhuebner.com] >>> Sent: Thursday, November 07, 2013 11:23 AM >>> To: user@mahout.apache.org >>> Subject: Re: Solr-recommender for Mahout 0.9 >>> >>> Does anyone know what the difference is between keeping the ids in a >> space delimited string and indexing a multivalued field of ids? I recently >> tried the latter since ... it felt right, however I am not sure which of >> both has which advantages. >>> >>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.fer...@gmail.com> wrote: >>> >>>> I have dismax (no edismax) but am not using it yet, using the default >> query, which does use 'AND'. I had much the same though as I slept on it. >> Changing to OR is now working much much better. So obvious it almost bit >> me, not good in this case... >>>> >>>> With only a trivially small amount of testing I'd say we have a new >> recommender on the block. >>>> >>>> If anyone would like to help eyeball test the thing let me know >> off-list. There are a few instructions I'll need to give. And it can't >> handle much load right now due to intentional design limits. >>>> >>>> >>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <james.d...@ingramcontent.com> >> wrote: >>>> >>>> Pat, >>>> >>>> Can you give us the query it generates when you enter "vampire werewolf >> zombie", q/qt/defType ? >>>> >>>> My guess is you're using the default query parser with "q.op=AND" , or, >> you're using dismax/edismax with a high "mm" (min-must-match) value. >>>> >>>> James Dyer >>>> Ingram Content Group >>>> (615) 213-4311 >>>> >>>> >>>> -----Original Message----- >>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com] >>>> Sent: Wednesday, November 06, 2013 5:53 PM >>>> To: s...@apache.org Schelter; user@mahout.apache.org >>>> Subject: Re: Solr-recommender for Mahout 0.9 >>>> >>>> Done, >>>> >>>> BTW I have the thing running on a demo site but am getting very poor >> results that I think are related to the Solr setup. I'd appreciate any >> ideas. >>>> >>>> The sample data has 27,000 items and something like 4000 users. The >> preference data is fairly dense since the users are professional reviewers >> and the items videos. >>>> >>>> 1) The number of item-item similarities that are kept is 100. Is this a >> good starting point? Ted, do you recall how many you used before? >>>> 2) The query is a simple text query made of space delimited video id >> strings. These are the same ids as are stored in the item-item similarity >> docs that Solr indexes. >>>> >>>> Hit thumbs up on one video you you get several recommendations. Hit >> thumbs up on several videos you get no recs. I'm either using the wrong >> query type or have it set up to be too restrictive. As I read through the >> docs if someone has a suggestion or pointer I'd appreciate it. >>>> >>>> BTW the same sort of thing happens with Title search. Search for >> "vampire werewolf zombie" you get no results, search for "zombie" you get >> several. >>>> >>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <s...@apache.org> wrote: >>>> >>>> Hi Pat, >>>> >>>> can you create issues for 1) and 2) ? Then I will try to get this into >>>> trunk asap. >>>> >>>> Best, >>>> Sebastian >>>> >>>> On 06.11.2013 19:13, Pat Ferrel wrote: >>>>> Trying to integrate the Solr-recoemmender with the latest Mahout >> snapshot. The project uses a modified RecommenderJob because it needs >> SequenceFile output and to get the location of the preparePreferenceMatrix >> directory. If #1 and #2 are addressed I can remove the modified Mahout code >> from the project and rely on the default implementations in Mahout 0.9. #3 >> is a longer term issue related to the creation of a CrossRowSimilarityJob. >>>>> >>>>> I have dropped the modified code from the Solr-recommender project and >> have a modified build of the current Mahout 0.9 snapshot. If the following >> changes are made to Mahout I can test and release a Mahout 0.9 version of >> the Solr-recommender. >>>>> >>>>> 1. Option to change RecommenderJob output format >>>>> >>>>> Can someone add an option to output a SequenceFile. I modified the >> code to do the following, note the SequenceFileOutputFormat.class as the >> last parameter but this should really be determined with an option I think. >>>>> >>>>> Job aggregateAndRecommend = prepareJob( >>>>> new Path(aggregateAndRecommendInput), outputPath, >> SequenceFileInputFormat.class, >>>>> PartialMultiplyMapper.class, VarLongWritable.class, >> PrefAndSimilarityColumnWritable.class, >>>>> AggregateAndRecommendReducer.class, VarLongWritable.class, >> RecommendedItemsWritable.class, >>>>> SequenceFileOutputFormat.class); >>>>> >>>>> 2. Visibility of preparePreferenceMatrix directory location >>>>> >>>>> The Solr-recommender needs to find where the RecommenderJob is putting >> it's output. >>>>> >>>>> Mahout 0.8 RecommenderJob code was: >>>>> public static final String DEFAULT_PREPARE_DIR = >> "preparePreferenceMatrix"; >>>>> >>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" >> inline in the code: >>>>> Path prepPath = getTempPath("preparePreferenceMatrix"); >>>>> >>>>> This change to Mahout 0.9 works: >>>>> public static final String DEFAULT_PREPARE_DIR = >> "preparePreferenceMatrix"; >>>>> and >>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR); >>>>> >>>>> You could also make this a getter method on the RecommenderJob Class >> instead of using a public constant. >>>>> >>>>> 3. Downsampling >>>>> >>>>> The downsampling for maximum prefs per user has been moved from >> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses >> matrix math instead of RSJ so it will no longer support downsampling until >> there is a hypothetical CrossRowSimilairtyJob with downsampling in it. >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> >>> >>> >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >> >> >> >> >> >> >> >> -------------------------- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> custom big data solutions & training >> Hadoop, Cascading, Cassandra & Solr >> >> >> >> >> >> -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr -------------------------- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr