Yeah, I think your plan sounds fine. Do you have a specific use case for diversity of results. I've been wondering if diversity of results would provide better perceived relevance.
Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Sep 27, 2018 at 1:39 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < dceccarel...@bloomberg.net> wrote: > Yeah, I think Kmeans might be a way to implement the "top 3 stories that > are more distant", but you can also have a more naïve (and faster) strategy > like > - sending a threshold > - scan the documents according to the relevance score > - select the top documents that have diversity > threshold. > > I would allow to define the strategy and select it from the request. > > From: solr-user@lucene.apache.org At: 09/27/18 18:25:43To: Diego > Ceccarelli (BLOOMBERG/ LONDON ) , solr-user@lucene.apache.org > Subject: Re: solr and diversification > > I've thought about this problem a little bit. What I was considering was > using Kmeans clustering to cluster the top 50 docs, then pulling the top > scoring doc form each cluster as the top documents. This should be fast and > effective at getting diversity. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) < > dceccarel...@bloomberg.net> wrote: > > > Hi, > > > > I'm considering to write a component for diversifying the results. I know > > that diversification can be achieved by using grouping but I'm thinking > > about something different and query biased. > > The idea is to have something that gets applied after the normal > retrieval > > and selects the top k documents more diverse based on some distance > metric: > > > > Example: > > imagine that you are asking for 10 rows, and you set diversify.rows=3 > > diversity.metric=tfidf diversify.field=body > > > > Solr might retrieve the the top 10 rows as usual, extract tfidf vectors > > for the bodies and select the top 3 stories that are more distant > according > > to the cosine similarity. > > This would be different from grouping because documents will be > > 'collapsed' or not based on the subset of documents retrieved for the > > query. > > Do you think it would make sense to have it as a component? any feedback > > / idea? > > > > > > > > >