I really think that a hard limit on number of indicators is just fine.  The
points that I have seen raised regarding this include:

a) this doesn't limit total size of indicator matrix.

I agree with this.  It doesn't.  And it shouldn't.  It does limit that size
per item which is really better for operational use.

b) an average would be better

Why?  The hard limit winds up limiting almost all items to exactly the
limit.  This means that this is very nearly the average.




On Wed, May 28, 2014 at 8:31 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:

> That’s what I thought, also why the total number of indicators is not
> limitable, right?
>
> For the Spark version, should we allow something like an average number of
> indicators per item? We will only be supporting LLR with that and as Ted
> and Ken point out that is the interesting thing to limit. It will mean a
> non-trivial bit of added processing if specified, obviously.
>
> On May 27, 2014, at 12:00 PM, Sebastian Schelter <s...@apache.org> wrote:
>
> I have added the threshold merely as a way to increase the performance of
> RowSimilarityJob. If a threshold is given, some item pairs don't need to be
> looked at. A simple example is if you use cooccurrence count as similarity
> measure, and set a threshold of n cooccurrences, than any pair containing
> an item with less than n interactions can be ignored. IIRC similar
> techniques are implemented for cosine and jaccard.
>
> Best,
> Sebastian
>
>
>
> On 05/27/2014 07:08 PM, Pat Ferrel wrote:
> >>
> >> On May 27, 2014, at 8:15 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> >>
> >> The threshold should not normally be used in the Mahout+Solr deployment
> >> style.
> >
> > Understood and that’s why an alternative way of specifying a cutoff may
> be a good idea.
> >
> >>
> >> This need is better supported by specifying the maximum number of
> >> indicators.  This is mathematically equivalent to specifying a fraction
> of
> >> values, but is more meaningful to users since good values for this
> number
> >> are pretty consistent across different uses (50-100 are reasonable
> values
> >> for most needs larger values are quite plausible).
> >
> > Assume you mean 50-100 as the average number per item.
> >
> > The total for the entire indicator matrix is what Ken was asking for.
> But I was thinking about the use with itemsimilarity where the user may not
> know the dimensionality since itemsimilarity assembles the matrix from
> individual prefs. The user probably knows the number of items in their
> catalog but the indicator matrix dimensionality is arbitrarily smaller.
> >
> > Currently the help reads:
> > --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem    try to cap the
> number of similar items per  item to this number  (default: 100)
> >
> > If this were actually the average # per item it would do what you
> describe but it looks like it’s a literal a cutoff per vector in the code.
> >
> > A cutoff based on the highest scores in the entire matrix seems to imply
> a sort when the total is larger than the average would allow and I don’t
> see an obvious sort being done in the MR.
> >
> > Anyway, it looks like we could do this by
> > 1) total number of values in the matrix (what Ken was asking for) This
> requires that the user know the dimensionality of the indicator matrix to
> be very useful.
> > 2) average number per item (what Ted describes) This seems the most
> intuitive and does not require the dimensionality be known
> > 3) fraction of the values. This might be useful if you are more
> interested in downsampling by score, at least it seems more useful than
> —threshold as it is today but maybe I’m missing some use cases? Is there
> really a need for a hard score threshold?
> >
> >
> >>
> >>
> >> On Tue, May 27, 2014 at 8:08 AM, Pat Ferrel <pat.fer...@gmail.com>
> wrote:
> >>
> >>> I was talking with Ken Krugler off list about the Mahout + Solr
> >>> recommender and he had an interesting request.
> >>>
> >>> When calculating the indicator/item similarity matrix using
> >>> ItemSimilarityJob there is a  --threshold option. Wouldn’t it be
> better to
> >>> have an option that specified the fraction of values kept in the entire
> >>> matrix based on their similarity strength? This is very difficult to do
> >>> with --threshold. It would be like expressing the threshold as a
> fraction
> >>> of total number of values rather than a strength value. Seems like this
> >>> would have the effect of tossing the least interesting similarities
> where
> >>> limiting per item (—maxSimilaritiesPerItem) could easily toss some of
> the
> >>> most interesting.
> >>>
> >>> At very least it seems like a better way of expressing the threshold,
> >>> doesn’t it?
> >>
>
>
>

Reply via email to