Re: Which database should I use with Mahout

Ted Dunning Thu, 23 May 2013 18:35:47 -0700

I think the simplest implementation is to just get extra results from the
recommender and rescore after the rough retrieval.  Integrating this into
the actual scoring engine is very hard since it depends on global
characteristics of the final result.


The same applies to result set clustering.



On Wed, May 22, 2013 at 9:51 PM, Johannes Schulte <
johannes.schu...@gmail.com> wrote:

> Yeah, that is what i had in mind as a simple solution. For examining bigger
> result sets I always fear the cost of loading a lot of stored fields,
> that's why i thought of including it in the scoring might be cool. It's not
> possible with a plain Collector that maintains a priorityqueue of docid and
> score but there should be some smart way to maintain a coordinate of the
> top ranking items so far. Anyway, that's a different story
>
> Thanks for the help so far!
>
>
> On Thu, May 23, 2013 at 2:14 AM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>
> > Yes what you are describing with diversification is something that I have
> > called anti-flood. It comes from the fact that we really are optimizing a
> > portfolio of recommendations rather than a batch of independent
> > recommendations. Doing this from first principles is very hard but there
> > are very simple heuristics that do well in many practical situations.
>  For
> > instance, simply penalizing the score of items based on how many items
> > ranked above them that they are excessively close to does wonders.
> >
> > Sent from my iPhone
> >
> > On May 22, 2013, at 13:04, Johannes Schulte <johannes.schu...@gmail.com>
> > wrote:
> >
> > > Okay i got it! I also always have used a basic form of dithering but we
> > > always called it shuffle since it basically was / is
> Collections.shuffle
> > on
> > > a bigger list of results and therefore doesnt take the rank or score
> into
> > > account. Will try that..
> > >
> > > With diversification i really meant more something that guarantees that
> > you
> > > maximize the intra-item distance (dithering often does but not on
> > purpose).
> > > The theory is, if i remember correctly, that if a user goes over the
> list
> > > from top to bottom and he didn't like the first item (which is assumed
> to
> > > be true if he looks at the second item, no idea if we really work that
> > > way), it makes sense to make a new shot with something completely
> > > different. and so on and on for the next items. I think it's also a
> topic
> > > for search engine results and i remember somethink like
> > "lipschitz-bandits"
> > > from my googling, but that is way above of what i am capable of doing.
> I
> > > just recognized that both amazon and netflix present multiple
> > > recommendation lists grouped by category, so in a way it's similar to
> > > search engine result clustering.
> > >
> > >
> > >
> > >
> > >
> > > On Wed, May 22, 2013 at 8:30 AM, Ted Dunning <ted.dunn...@gmail.com>
> > wrote:
> > >
> > >> On Tue, May 21, 2013 at 10:34 PM, Johannes Schulte <
> > >> johannes.schu...@gmail.com> wrote:
> > >>
> > >>> Thanks for the list...as a non native speaker I got problems
> > >> understanding
> > >>> the meaning of dithering here.
> > >>
> > >> Sorry about that.  Your English is good enough that I hadn't noticed
> any
> > >> deficit.
> > >>
> > >> Dithering is constructive mixing of the recommendation results.  The
> > idea
> > >> is to reorder the top results only slightly and the deeper results
> more
> > so.
> > >> There are several good effects and one (slightly) bad one.
> > >>
> > >> The good effects are:
> > >>
> > >> a) there is much less of a sharp boundary at the end of the first page
> > of
> > >> results.  This makes the statistics of the recommender work better and
> > also
> > >> helps the recommender not get stuck recommending just the things which
> > >> already appear on the first page.
> > >>
> > >> b) results that are very deep in the results can still be shown
> > >> occasionally.  This means that if the rec engine has even a hint that
> > >> something is good, it has a chance of increasing the ranking by
> > gathering
> > >> more data.  This is a bit different from (a)
> > >>
> > >> c) (bonus benefit) users like seeing novel things.  Even if they have
> > done
> > >> nothing to change their recommendations, they like seeing that you
> have
> > >> changed something so they keep coming back to the recommendation page.
> > >>
> > >> The major bad effect is that you are purposely decreasing relevance in
> > the
> > >> short term in order to get more information that will improve
> relevance
> > in
> > >> the long term.  The improvements dramatically outweigh this small
> > problem.
> > >>
> > >>
> > >>> I got the feeling that somewhere between a) and d) there is also
> > >>> diversification of items in the recommendation list, so increasing
> the
> > >>> distance between the list items according to some metric like tf/idf
> on
> > >>> item information. Never tried that, but with lucene / solr it should
> be
> > >>> possible to use this information during scoring..
> > >>
> > >> Yes.  But no.
> > >>
> > >> This can be done at the presentation tier entirely.  I often do it by
> > >> defining a score based solely on rank, typically something like
> log(r).
> >  I
> > >> add small amounts of noise to this synthetic score, often distributed
> > >> exponentially with a small mean.  Then I sort the results according to
> > this
> > >> sum.
> > >>
> > >> Here are some simulated results computed using R:
> > >>
> > >>> (order((log(r) - runif(500, max=2)))[1:20])
> > >> [1]  1  2  3  6  5  4 14  9  8 10  7 17 11 15 13 22 28 12 20 39
> > >> [1]  1  2  5  3  4  8  6 10  9 16 24 31 20 30 13 18  7 14 36 38
> > >> [1]  3  1  5  2 10  4  8  7 14 21 19 26 29 13 27 15  6 12 33  9
> > >> [1]  1  2  5  3  6 17  4 20 18  7 19  9 25  8 29 21 15 27 28 12
> > >> [1]  1  2  5  3  7  4  8 11  9 15 10  6 33 37 17 27 36 16 34 38
> > >> [1]  1  4  2  5  9  3 14 13 12 17 22 25  7 15 18 36 16  6 20 29
> > >> [1]  1  3  4  7  2  6  5 12 18 17 13 24 27 10  8 20 14 34  9 46
> > >> [1]  3  1  2  6 12  8  7  5  4 19 11 26 10 15 28 35  9 20 42 25
> > >>
> > >> As you can see, the first four results are commonly single digits.
>  This
> > >> comes about because the uniform noise that I have subtracted from the
> > log
> > >> can only make a difference of 2 to the log with is equivalent of
> > changing
> > >> the rank by a factor of about 7. If we were to use different noise
> > >> distributions we would get somewhat different kinds of perturbation.
> >  For
> > >> instance, using exponentially distributed noise gives mostly tame
> > results
> > >> with some real surprises:
> > >>
> > >>> (order((log(r) - 0.3*rexp(500)))[1:20])
> > >> [1]  1  2  3  8  4  5  9  6  7 25 14 11 13 24 10 31 34 12 22 21
> > >> [1]  1  2  5  4  3  6  7 12  8 10  9 17 13 11 14 25 64 15 47 19
> > >> [1]  1  2  3  4  5  6  7 10  8  9 11 21 13 12 15 16 14 25 18 33
> > >> [1]  1  2  3 10  4  5  7 14  6  8 13  9 15 25 16 11 20 12 17 54
> > >> [1]  1  3  2  4  7  5  6  8 11 23  9 32 18 10 13 15 12 48 14 19
> > >> [1]  1  3  2  4  5 10 12  6  9  7  8 18 16 17 11 13 25 14 15 19
> > >> [1]  6  1  2  4  3  5  9 11  7 15  8 10 14 12 19 16 13 25 39 18
> > >> [1]  1  2  3  4 30  5  7  6  9  8 16 11 10 15 12 13 37 14 31 23
> > >> [1]  1  2  3  4  9 16  5  6  8  7 10 13 11 17 15 19 12 20 14 26
> > >> [1]  1  2  3 13  5  4  7  6  8 15 12 11  9 10 36 14 24 70 19 16
> > >> [1]   1   2   6   3   5   4  11  22   7   9 250   8  10  15  12  17 13
> >  40
> > >> 16  14
> > >>
> > >>
> > >>
> > >>
> > >>> Have a nice day
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Wed, May 22, 2013 at 2:30 AM, Ted Dunning <ted.dunn...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> I have so far just used the weights that Solr applies natively.
> > >>>>
> > >>>> In my experience, what makes a recommendation engine work better is,
> > in
> > >>>> order of importance,
> > >>>>
> > >>>> a) dithering so that you gather wider data
> > >>>>
> > >>>> b) using multiple sources of input
> > >>>>
> > >>>> c) returning results quickly and reliably
> > >>>>
> > >>>> d) the actual algorithm or weighting scheme
> > >>>>
> > >>>> If you can cover items a-c in a real business, you are very lucky.
> >  The
> > >>>> search engine approach handles (b) and (c) by nature which massively
> > >>>> improves the likelihood of ever getting to examine (d).
> > >>>>
> > >>>>
> > >>>> On Tue, May 21, 2013 at 1:13 AM, Johannes Schulte <
> > >>>> johannes.schu...@gmail.com> wrote:
> > >>>>
> > >>>>> Thanks! Could you also add how to learn the weights you talked
> about,
> > >>> or
> > >>>> at
> > >>>>> least a hint? Learning weights for search engine query terms always
> > >>>> sounds
> > >>>>> like  "learning to rank" to me but this always seemed pretty
> > >>> complicated
> > >>>>> and i never managed to try it out..
> > >>
> >
>

Re: Which database should I use with Mahout

Reply via email to