Re: Solr-recommender for Mahout 0.9

Pat Ferrel Thu, 07 Nov 2013 19:32:04 -0800

Another approach would be to weight the terms in the docs by there Mahout 
similarity strength. But that will be for another day.


My current question is whether Lucene looks at word proximity. I see the query 
syntax supports proximity but I don’t see that it is default so that’s good.


On Nov 7, 2013, at 12:41 PM, Dyer, James <james.d...@ingramcontent.com> wrote:

Best to my knowledge, Lucene does not care about the position of a keyword 
within a document.

You could bucket the ids into several fields.  Then use a dismax query to boost 
the top-tier ids more than then second, etc.

A more fine-grained approach would probably involve a custom Similarity class 
that scales the score based on its position in the document.  If we did this, 
it might be simpler to index as 1 single-valued field so each id was position+1 
rather than position+100, etc.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Pat Ferrel [mailto:pat.fer...@gmail.com] 
Sent: Thursday, November 07, 2013 1:46 PM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Interesting to think about ordering and adjacentness. The index ids are sorted 
by Mahout strength so the first id is the most similar to the row key and so 
forth. But the query is ordered buy recency. In both cases the first id is in 
some sense the most important. Does Solr/Lucene care about closeness to the top 
of doc for queries or indexed docs? I don't recall any mention of this.

However adjacentness has no meaning in recommendations though I think it's used 
in default queries so I may have to account for that.

The object returned is an ordered list of ids. I use only the IDs now but there 
are cases when the contents are also of interest; shopping cart/watchlist 
queries for example.

On Nov 7, 2013, at 10:00 AM, Dyer, James <james.d...@ingramcontent.com> wrote:

The multivalued field will obey the "positionIncrementGap" value you specify 
(default=100).  So for querying purposes, those id's will be 100 (or whatever 
you specified) positions apart.  So a phrase search for adjacent ids would not 
match, unless you set the slop for >= positionIncrementGap.  Other than this, 
both scenarios index the same.

For stored fields, solr returns an array of values for multivalued fields, 
which is convienent when writing a UI.

James Dyer
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Dominik Hübner [mailto:cont...@dhuebner.com] 
Sent: Thursday, November 07, 2013 11:23 AM
To: user@mahout.apache.org
Subject: Re: Solr-recommender for Mahout 0.9

Does anyone know what the difference is between keeping the ids in a space 
delimited string and indexing a multivalued field of ids? I recently tried the 
latter since ... it felt right, however I am not sure which of both has which 
advantages.

On 07 Nov 2013, at 18:18, Pat Ferrel <pat.fer...@gmail.com> wrote:

> I have dismax (no edismax) but am not using it yet, using the default query, 
> which does use 'AND'. I had much the same though as I slept on it. Changing 
> to OR is now working much much better. So obvious it almost bit me, not good 
> in this case...
> 
> With only a trivially small amount of testing I'd say we have a new 
> recommender on the block.
> 
> If anyone would like to help eyeball test the thing let me know off-list. 
> There are a few instructions I'll need to give. And it can't handle much load 
> right now due to intentional design limits.
> 
> 
> On Nov 7, 2013, at 6:11 AM, Dyer, James <james.d...@ingramcontent.com> wrote:
> 
> Pat,
> 
> Can you give us the query it generates when you enter "vampire werewolf 
> zombie", q/qt/defType ?
> 
> My guess is you're using the default query parser with "q.op=AND" , or, 
> you're using dismax/edismax with a high "mm" (min-must-match) value.
> 
> James Dyer
> Ingram Content Group
> (615) 213-4311
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat.fer...@gmail.com] 
> Sent: Wednesday, November 06, 2013 5:53 PM
> To: s...@apache.org Schelter; user@mahout.apache.org
> Subject: Re: Solr-recommender for Mahout 0.9
> 
> Done,
> 
> BTW I have the thing running on a demo site but am getting very poor results 
> that I think are related to the Solr setup. I'd appreciate any ideas.
> 
> The sample data has 27,000 items and something like 4000 users. The 
> preference data is fairly dense since the users are professional reviewers 
> and the items videos.
> 
> 1) The number of item-item similarities that are kept is 100. Is this a good 
> starting point? Ted, do you recall how many you used before?
> 2) The query is a simple text query made of space delimited video id strings. 
> These are the same ids as are stored in the item-item similarity docs that 
> Solr indexes.
> 
> Hit thumbs up on one video you you get several recommendations. Hit thumbs up 
> on several videos you get no recs. I'm either using the wrong query type or 
> have it set up to be too restrictive. As I read through the docs if someone 
> has a suggestion or pointer I'd appreciate it. 
> 
> BTW the same sort of thing happens with Title search. Search for "vampire 
> werewolf zombie" you get no results, search for "zombie" you get several.
> 
> On Nov 6, 2013, at 2:18 PM, Sebastian Schelter <s...@apache.org> wrote:
> 
> Hi Pat,
> 
> can you create issues for 1) and 2) ? Then I will try to get this into
> trunk asap.
> 
> Best,
> Sebastian
> 
> On 06.11.2013 19:13, Pat Ferrel wrote:
>> Trying to integrate the Solr-recoemmender with the latest Mahout snapshot. 
>> The project uses a modified RecommenderJob because it needs SequenceFile 
>> output and to get the location of the preparePreferenceMatrix directory. If 
>> #1 and #2 are addressed I can remove the modified Mahout code from the 
>> project and rely on the default implementations in Mahout 0.9. #3 is a 
>> longer term issue related to the creation of a CrossRowSimilarityJob. 
>> 
>> I have dropped the modified code from the Solr-recommender project and have 
>> a modified build of the current Mahout 0.9 snapshot. If the following 
>> changes are made to Mahout I can test and release a Mahout 0.9 version of 
>> the Solr-recommender.
>> 
>> 1. Option to change RecommenderJob output format
>> 
>> Can someone add an option to output a SequenceFile. I modified the code to 
>> do the following, note the SequenceFileOutputFormat.class as the last 
>> parameter but this should really be determined with an option I think.
>> 
>>  Job aggregateAndRecommend = prepareJob(
>>          new Path(aggregateAndRecommendInput), outputPath, 
>> SequenceFileInputFormat.class,
>>          PartialMultiplyMapper.class, VarLongWritable.class, 
>> PrefAndSimilarityColumnWritable.class,
>>          AggregateAndRecommendReducer.class, VarLongWritable.class, 
>> RecommendedItemsWritable.class,
>>          SequenceFileOutputFormat.class);
>> 
>> 2. Visibility of preparePreferenceMatrix directory location
>> 
>> The Solr-recommender needs to find where the RecommenderJob is putting it's 
>> output. 
>> 
>> Mahout 0.8 RecommenderJob code was:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> 
>> Mahout 0.9 RecommenderJob code just puts "preparePreferenceMatrix" inline in 
>> the code:
>> Path prepPath = getTempPath("preparePreferenceMatrix");
>> 
>> This change to Mahout 0.9 works:
>> public static final String DEFAULT_PREPARE_DIR = "preparePreferenceMatrix";
>> and
>> Path prepPath = getTempPath(DEFAULT_PREPARE_DIR);
>> 
>> You could also make this a getter method on the RecommenderJob Class instead 
>> of using a public constant.
>> 
>> 3. Downsampling
>> 
>> The downsampling for maximum prefs per user has been moved from 
>> PreparePreferenceMatrixJob to RowSimilarityJob. The XRecommenderJob uses 
>> matrix math instead of RSJ so it will no longer support downsampling until 
>> there is a hypothetical CrossRowSimilairtyJob with downsampling in it.
>> 
>> 
> 
> 
> 
> 
>

Re: Solr-recommender for Mahout 0.9

Reply via email to