If you haven’t already, might want to check out maximal marginal relevance...original paper: Carbonell and Goldstein.
On Thu, Sep 27, 2018 at 7:29 PM Joel Bernstein <joels...@gmail.com> wrote: > Yeah, I think your plan sounds fine. > > Do you have a specific use case for diversity of results. I've been > wondering if diversity of results would provide better perceived relevance. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < > dceccarel...@bloomberg.net> wrote: > > > Yeah, I think Kmeans might be a way to implement the "top 3 stories that > > are more distant", but you can also have a more naïve (and faster) > strategy > > like > > - sending a threshold > > - scan the documents according to the relevance score > > - select the top documents that have diversity > threshold. > > > > I would allow to define the strategy and select it from the request. > > > > From: solr-user@lucene.apache.org At: 09/27/18 18:25:43To: Diego > > Ceccarelli (BLOOMBERG/ LONDON ) , solr-user@lucene.apache.org > > Subject: Re: solr and diversification > > > > I've thought about this problem a little bit. What I was considering was > > using Kmeans clustering to cluster the top 50 docs, then pulling the top > > scoring doc form each cluster as the top documents. This should be fast > and > > effective at getting diversity. > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > > > On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < > > dceccarel...@bloomberg.net> wrote: > > > > > Hi, > > > > > > I'm considering to write a component for diversifying the results. I > know > > > that diversification can be achieved by using grouping but I'm thinking > > > about something different and query biased. > > > The idea is to have something that gets applied after the normal > > retrieval > > > and selects the top k documents more diverse based on some distance > > metric: > > > > > > Example: > > > imagine that you are asking for 10 rows, and you set diversify.rows=3 > > > diversity.metric=tfidf diversify.field=body > > > > > > Solr might retrieve the the top 10 rows as usual, extract tfidf vectors > > > for the bodies and select the top 3 stories that are more distant > > according > > > to the cosine similarity. > > > This would be different from grouping because documents will be > > > 'collapsed' or not based on the subset of documents retrieved for the > > > query. > > > Do you think it would make sense to have it as a component? any > feedback > > > / idea? > > > > > > > > > > > > > > > >