That’s what I thought, also why the total number of indicators is not 
limitable, right?

For the Spark version, should we allow something like an average number of 
indicators per item? We will only be supporting LLR with that and as Ted and 
Ken point out that is the interesting thing to limit. It will mean a 
non-trivial bit of added processing if specified, obviously.

On May 27, 2014, at 12:00 PM, Sebastian Schelter <s...@apache.org> wrote:

I have added the threshold merely as a way to increase the performance of 
RowSimilarityJob. If a threshold is given, some item pairs don't need to be 
looked at. A simple example is if you use cooccurrence count as similarity 
measure, and set a threshold of n cooccurrences, than any pair containing an 
item with less than n interactions can be ignored. IIRC similar techniques are 
implemented for cosine and jaccard.

Best,
Sebastian



On 05/27/2014 07:08 PM, Pat Ferrel wrote:
>> 
>> On May 27, 2014, at 8:15 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>> 
>> The threshold should not normally be used in the Mahout+Solr deployment
>> style.
> 
> Understood and that’s why an alternative way of specifying a cutoff may be a 
> good idea.
> 
>> 
>> This need is better supported by specifying the maximum number of
>> indicators.  This is mathematically equivalent to specifying a fraction of
>> values, but is more meaningful to users since good values for this number
>> are pretty consistent across different uses (50-100 are reasonable values
>> for most needs larger values are quite plausible).
> 
> Assume you mean 50-100 as the average number per item.
> 
> The total for the entire indicator matrix is what Ken was asking for. But I 
> was thinking about the use with itemsimilarity where the user may not know 
> the dimensionality since itemsimilarity assembles the matrix from individual 
> prefs. The user probably knows the number of items in their catalog but the 
> indicator matrix dimensionality is arbitrarily smaller.
> 
> Currently the help reads:
> --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the number 
> of similar items per  item to this number  (default: 100)
> 
> If this were actually the average # per item it would do what you describe 
> but it looks like it’s a literal a cutoff per vector in the code.
> 
> A cutoff based on the highest scores in the entire matrix seems to imply a 
> sort when the total is larger than the average would allow and I don’t see an 
> obvious sort being done in the MR.
> 
> Anyway, it looks like we could do this by
> 1) total number of values in the matrix (what Ken was asking for) This 
> requires that the user know the dimensionality of the indicator matrix to be 
> very useful.
> 2) average number per item (what Ted describes) This seems the most intuitive 
> and does not require the dimensionality be known
> 3) fraction of the values. This might be useful if you are more interested in 
> downsampling by score, at least it seems more useful than —threshold as it is 
> today but maybe I’m missing some use cases? Is there really a need for a hard 
> score threshold?
> 
> 
>> 
>> 
>> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>> 
>>> I was talking with Ken Krugler off list about the Mahout + Solr
>>> recommender and he had an interesting request.
>>> 
>>> When calculating the indicator/item similarity matrix using
>>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be better to
>>> have an option that specified the fraction of values kept in the entire
>>> matrix based on their similarity strength? This is very difficult to do
>>> with --threshold. It would be like expressing the threshold as a fraction
>>> of total number of values rather than a strength value. Seems like this
>>> would have the effect of tossing the least interesting similarities where
>>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of the
>>> most interesting.
>>> 
>>> At very least it seems like a better way of expressing the threshold,
>>> doesn’t it?
>> 


Reply via email to