Hello all,

This post is getting no replies after several days on the Solr user list, so I 
thought I would rewrite it as a question about a possible feature for Solr.

In our use case we have a large number of documents and several facets such as 
Author and Subject, that have a very large number of values.  Since we index 
the full text of nearly 10 million books, it is easy for a query to return a 
very large number of hits.

Here is the problem:

If relevance ranking is working well, in theory it doesn't matter how many hits 
the user gets as long as the best results show up in the first page of results. 
 When showing facet values, if the values for a particular facet have a large 
number of values, the general practice is to show a relatively small number of 
facet values, selected as those values with the highest counts in the entire 
result set.   However, assuming a very large result set, these facet counts 
will be affected by the large number of results that are not relevant to the 
query

As an example, if you search in our full-text collection for "jaguar" you get 
170,000 hits.  If I am looking for the car rather than the OS or the animal, I 
might expect to be able to click on a facet and limit my results to the car.  
However, facets containing the word "car" or "automobile" are not in the top 5 
facets that we show.  If you click on "more"  you will see "automobile 
periodicals" but not the rest of the facets containing the word automobile .  
This occurs because the facet counts are for all 170,000 hits.  The facet 
counts  for at least 160,000 irrelevant hits are included (assuming only the 
top 10,000 hits are relevant) .

What we would like to do is *select* which facet values to show, based on their 
counts in the *most relevant subset* of documents, but display the actual 
counts for the full set:

1)      get the facet counts for the N most relevant documents (N = 10,000 for 
example)
2)      select the 5 or 30 facet values with the highest counts for those 
relevant documents.
3)      display only the facet values for those 5 or 30 values, but display the 
counts for those values against the entire result set.

This is possible to kludge up (subject to some scaling considerations) in the 
following way:

1)      Consider only the 1000 most relevant documents for doing the 
calculation so N = 1,000
2)      do your query and get the unique document ids for the N most relevant 
documents. (i.e. set rows=N)  also get the facet values and counts for the top 
M facets, where M is some very large number and store the facet values and 
counts in some data structure.
3)      run a second query which is the same as the first, but add a filter 
query for those 1000 unique  ids, set rows =1 but get facet counts for the top 
30 facet values
4)      Grab the top 5 or 30 facet values from this second query.  These are 
your most relevant facet values
5)      Use the list of values from the previous step to retrieve the 
appropriate counts for the whole result set from the earlier stage where you 
stored the facet counts for the whole result set

It would seem that this could be done much more efficiently inside of 
Solr/Lucene, since instead of getting the unique ids for the N most relevant 
documents, and sending those back to Solr, the code actually has access to 
bitsets containing the internal Lucene index ids which get used in the filter 
queries.  Other steps in the process could probably be streamlined as well.

Is there already some faceting code work being done along this line?
Would it make sense to open a JIRA issue for this?


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search


Reply via email to