Thank you Jack for the suggestion. We can try group by site. But considering that number of sites are only about 1000 against the index size of 5 million, One can expect most of the hits would be hidden and for certain specific keywords only a handful of actual results could be displayed if results are grouped by site.
we already group on a signature field to identify duplicate content in these 5 million+ docs. But here the number of duplicates are only about 3-5% maximum. Is there any workaround for these limitations with grouping? Thanks Shyam On Thu, Sep 5, 2013 at 9:16 PM, Jack Krupansky <j...@basetechnology.com>wrote: > The grouping (field collapsing) feature somewhat addresses this - group by > a "site" field and then if more than one or a few top pages are from the > same site they get grouped or collapsed so that you can see more sites in a > few results. > > See: > http://wiki.apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing> > https://cwiki.apache.org/**confluence/display/solr/**Result+Grouping<https://cwiki.apache.org/confluence/display/solr/Result+Grouping> > > -- Jack Krupansky > > -----Original Message----- From: Sai Gadde > Sent: Thursday, September 05, 2013 2:27 AM > To: solr-user@lucene.apache.org > Subject: Tweaking boosts for more search results variety > > > Our index is aggregated content from various sites on the web. We want good > user experience by showing multiple sites in the search results. In our > setup we are seeing most of the results from same site on the top. > > Here is some information regarding queries and schema > site - String field. We have about 1000 sites in index > sitetype - String field. we have 3 site types > omitNorms="true" for both the fields > > Doc count varies largely based on site and sitetype by a factor of 10 - > 1000 times > Total index size is about 5 million docs. > Solr Version: 4.0 > > In our queries we have a fixed and preferential boost for certain sites. > sitetype has different and fixed boosts for 3 possible values. We turned > off Inverse Document Frequency (IDF) for these boosts to work properly. > Other text fields are boosted based on search keywords only. > > With this setup we often see a bunch of hits from a single site followed by > next etc., > Is there any solution to see results from variety of sites and still keep > the preferential boosts in place? >