I've thought about this problem a little bit. What I was considering was
using Kmeans clustering to cluster the top 50 docs, then pulling the top
scoring doc form each cluster as the top documents. This should be fast and
effective at getting diversity.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Sep 27, 2018 at 1:20 PM Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net> wrote:

> Hi,
>
> I'm considering to write a component for diversifying the results. I know
> that diversification can be achieved by using grouping but I'm thinking
> about something different and query biased.
> The idea is to have something that gets applied after the normal retrieval
> and selects the top k documents more diverse based on some distance metric:
>
> Example:
> imagine that you are asking for 10 rows, and you set diversify.rows=3
> diversity.metric=tfidf  diversify.field=body
>
> Solr might retrieve the the top 10 rows as usual, extract tfidf vectors
> for the bodies and select the top 3 stories that are more distant according
> to the cosine similarity.
> This would be different from grouping because documents will be
> 'collapsed' or not based on the subset of documents retrieved for the
> query.
> Do you think it would make sense to have it as a component?  any feedback
> / idea?
>
>
>

Reply via email to