One other thing I should have mentioned is that if you care about setting 
weights on incoming terms, you can boost them using the ^<value> syntax.

E.g. "the_kings_speech^1.5 OR skyfalll^0.5 OR looper^3.0…"

If you want to account for weights of terms in the index, it's a bit harder. 
You can do simple boosting by replicating terms, or you can use payload-based 
boosting, or you could code up your own Similarity class that takes advantage 
of side-channel data.

But in my experience the gain from applying weights to terms int he index isn't 
very significant.

And usually I just Solr to generate a candidate list, then I do more specific 
scoring to find the N best form N*4 candidates.

-- Ken

On Nov 8, 2013, at 9:54am, Ted Dunning <ted.dunn...@gmail.com> wrote:

> For recommendation work, I suggest that it would be better to simply code
> out an explicit OR query.
> 
> 
> 
> 
> On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler 
> <kkrugler_li...@transpac.com>wrote:
> 
>> Hi Pat,
>> 
>> On Nov 7, 2013, at 7:30pm, Pat Ferrel <pat.fer...@gmail.com> wrote:
>> 
>>> Another approach would be to weight the terms in the docs by there
>> Mahout similarity strength. But that will be for another day.
>>> 
>>> My current question is whether Lucene looks at word proximity. I see the
>> query syntax supports proximity but I don’t see that it is default so
>> that’s good.
>> 
>> Based on your description of what you do (generate an OR query of N terms)
>> then no, you shouldn't be getting a boost from proximity.
>> 
>> Note that with edismax you can specify a phrase boost, but it will be on
>> the entire set of terms being searched, so unlikely to come into play even
>> if you were using that.
>> 
>> -- Ken
>> 
>> 
>>> 
>>> 
>>> On Nov 7, 2013, at 12:41 PM, Dyer, James <james.d...@ingramcontent.com>
>> wrote:
>>> 
>>> Best to my knowledge, Lucene does not care about the position of a
>> keyword within a document.
>>> 
>>> You could bucket the ids into several fields.  Then use a dismax query
>> to boost the top-tier ids more than then second, etc.
>>> 
>>> A more fine-grained approach would probably involve a custom Similarity
>> class that scales the score based on its position in the document.  If we
>> did this, it might be simpler to index as 1 single-valued field so each id
>> was position+1 rather than position+100, etc.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com]
>>> Sent: Thursday, November 07, 2013 1:46 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Interesting to think about ordering and adjacentness. The index ids are
>> sorted by Mahout strength so the first id is the most similar to the row
>> key and so forth. But the query is ordered buy recency. In both cases the
>> first id is in some sense the most important. Does Solr/Lucene care about
>> closeness to the top of doc for queries or indexed docs? I don't recall any
>> mention of this.
>>> 
>>> However adjacentness has no meaning in recommendations though I think
>> it's used in default queries so I may have to account for that.
>>> 
>>> The object returned is an ordered list of ids. I use only the IDs now
>> but there are cases when the contents are also of interest; shopping
>> cart/watchlist queries for example.
>>> 
>>> On Nov 7, 2013, at 10:00 AM, Dyer, James <james.d...@ingramcontent.com>
>> wrote:
>>> 
>>> The multivalued field will obey the "positionIncrementGap" value you
>> specify (default=100).  So for querying purposes, those id's will be 100
>> (or whatever you specified) positions apart.  So a phrase search for
>> adjacent ids would not match, unless you set the slop for >=
>> positionIncrementGap.  Other than this, both scenarios index the same.
>>> 
>>> For stored fields, solr returns an array of values for multivalued
>> fields, which is convienent when writing a UI.
>>> 
>>> James Dyer
>>> Ingram Content Group
>>> (615) 213-4311
>>> 
>>> 
>>> -----Original Message-----
>>> From: Dominik Hübner [mailto:cont...@dhuebner.com]
>>> Sent: Thursday, November 07, 2013 11:23 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Solr-recommender for Mahout 0.9
>>> 
>>> Does anyone know what the difference is between keeping the ids in a
>> space delimited string and indexing a multivalued field of ids? I recently
>> tried the latter since ... it felt right, however I am not sure which of
>> both has which advantages.
>>> 
>>> On 07 Nov 2013, at 18:18, Pat Ferrel <pat.fer...@gmail.com> wrote:
>>> 
>>>> I have dismax (no edismax) but am not using it yet, using the default
>> query, which does use 'AND'. I had much the same though as I slept on it.
>> Changing to OR is now working much much better. So obvious it almost bit
>> me, not good in this case...
>>>> 
>>>> With only a trivially small amount of testing I'd say we have a new
>> recommender on the block.
>>>> 
>>>> If anyone would like to help eyeball test the thing let me know
>> off-list. There are a few instructions I'll need to give. And it can't
>> handle much load right now due to intentional design limits.
>>>> 
>>>> 
>>>> On Nov 7, 2013, at 6:11 AM, Dyer, James <james.d...@ingramcontent.com>
>> wrote:
>>>> 
>>>> Pat,
>>>> 
>>>> Can you give us the query it generates when you enter "vampire werewolf
>> zombie", q/qt/defType ?
>>>> 
>>>> My guess is you're using the default query parser with "q.op=AND" , or,
>> you're using dismax/edismax with a high "mm" (min-must-match) value.
>>>> 
>>>> James Dyer
>>>> Ingram Content Group
>>>> (615) 213-4311
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: Pat Ferrel [mailto:pat.fer...@gmail.com]
>>>> Sent: Wednesday, November 06, 2013 5:53 PM
>>>> To: s...@apache.org Schelter; user@mahout.apache.org
>>>> Subject: Re: Solr-recommender for Mahout 0.9
>>>> 
>>>> Done,
>>>> 
>>>> BTW I have the thing running on a demo site but am getting very poor
>> results that I think are related to the Solr setup. I'd appreciate any
>> ideas.
>>>> 
>>>> The sample data has 27,000 items and something like 4000 users. The
>> preference data is fairly dense since the users are professional reviewers
>> and the items videos.
>>>> 
>>>> 1) The number of item-item similarities that are kept is 100. Is this a
>> good starting point? Ted, do you recall how many you used before?
>>>> 2) The query is a simple text query made of space delimited video id
>> strings. These are the same ids as are stored in the item-item similarity
>> docs that Solr indexes.
>>>> 
>>>> Hit thumbs up on one video you you get several recommendations. Hit
>> thumbs up on several videos you get no recs. I'm either using the wrong
>> query type or have it set up to be too restrictive. As I read through the
>> docs if someone has a suggestion or pointer I'd appreciate it.
>>>> 
>>>> BTW the same sort of thing happens with Title search. Search for
>> "vampire werewolf zombie" you get no results, search for "zombie" you get
>> several.
>>>> 
>>>> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <s...@apache.org> wrote:
>>>> 
>>>> Hi Pat,
>>>> 
>>>> can you create issues for 1) and 2) ? Then I will try to get this into
>>>> trunk asap.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> On 06.11.2013 19:13, Pat Ferrel wrote:
>>>>> Trying to integrate the Solr-recoemmender with the latest Mahout
>> snapshot. The project uses a modified RecommenderJob because it needs
>> SequenceFile output and to get the location of the preparePreferenceMatrix
>> directory. If #1 and #2 are addressed I can remove the modified Mahout code
>> from the project and rely on the default implementations in Mahout 0.9. #3
>> is a longer term issue related to the creation of a CrossRowSimilarityJob.
>>>>> 
>>>>> I have dropped the modified code from the Solr-recommender project and
>> have a modified build of the current Mahout 0.9 snapshot. If the following
>> changes are made to Mahout I can test and release a Mahout 0.9 version of
>> the Solr-recommender.
>>>>> 
>>>>> 1. Option to change RecommenderJob output format
>>>>> 
>>>>> Can someone add an option to output a SequenceFile. I modified the
>> code to do the following, note the SequenceFileOutputFormat.class as the
>> last parameter but this should really be determined with an option I think.
>>>>> 
>>>>> Job aggregateAndRecommend = prepareJob(
>>>>>        new Path(aggregateAndRecommendInput), outputPath,
>> SequenceFileInputFormat.class,
>>>>>        PartialMultiplyMapper.class, VarLongWritable.class,
>> PrefAndSimilarityColumnWritable.class,
>>>>>        AggregateAndRecommendReducer.class, VarLongWritable.class,
>> RecommendedItemsWritable.class,
>>>>>        SequenceFileOutputFormat.class);
>>>>> 
>>>>> 2. Visibility of preparePreferenceMatrix directory location
>>>>> 
>>>>> The Solr-recommender needs to find where the RecommenderJob is putting
>> it's output.
>>>>> 
>>>>> Mahout 0.8 RecommenderJob code was:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> 
>>>>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix"
>> inline in the code:
>>>>> Path prepPath = getTempPath("preparePreferenceMatrix");
>>>>> 
>>>>> This change to Mahout 0.9 works:
>>>>> public static final String DEFAULT_PREPARE_DIR =
>> "preparePreferenceMatrix";
>>>>> and
>>>>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>>>>> 
>>>>> You could also make this a getter method on the RecommenderJob Class
>> instead of using a public constant.
>>>>> 
>>>>> 3. Downsampling
>>>>> 
>>>>> The downsampling for maximum prefs per user has been moved from
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses
>> matrix math instead of RSJ so it will no longer support downsampling until
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>> 
>> 
>> 
>> 
>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to