RE: MultiValued facet behavior question

Bob Sandiford Wed, 22 Jun 2011 06:51:56 -0700

Hi, Bill (and others).

I post this for what it's worth - it's a very specialized resolution we wrote 
to a similar requirement that may help with your (and similar) requirements.

Caveats abound [1]

We're running 3.1.

We wanted to be able to return facets which matched on our actual search, 
rather than all facets from the entire result set.  For example, if a user 
searches for author 'Twain', we present to them a list of facets which match 
'Twain', and exclude facets where 'Twain' is not found.  (Now - we don't tell 
our users that these are 'facet' values - we just present an alpha-sorted list 
of author names with a count of associated documents) So, we search our Author 
search field to identify matching documents, get all the facets (i.e. normal 
Solr processing to this point), and then filter that facet set to include only 
those that match our original search.

We added our own extra facet parameter (facet.sirsidynix.filter.facets) to 
instruct Solr when to do this special facet filtering. We modified SimpleFacets 
method getTermCounts right before the final "return counts;" like this:

      // Custom SirsiDynix code.
      if (params.getBool(FacetParams.FACET_SIRSIDYNIX_FILTER_FACETS, false))
      {
          counts = filterCounts(field, counts);
      }
    return counts;

and added method 'filterCounts()' which is this class, basically wrapping 
things up to run the search against each facet value, setting up MemoryIndex 
instances based on our schema, inserting the facet value, and running our 
original query against the MemoryIndex.  Anything that matches has a score > 0, 
and those are the only ones we keep:

    /**
     * Custom SirsiDynix code:
     * Filters counts down to only those entries that match the original
     * query.  Does this by using lucene's MemoryIndex - a very fast, in-memory,
     * single document index that can have queries run against it.
     * For each string value in count, we create a MemoryIndex and run the
     * original query against it.  Anything with a score > 0 means a 'hit', so
     * the value matches the original query, and we'll retain it.  Score 0 means
     * no hit (i.e. was a facet value that was associated with a document that 
matched
     * the query, but the facet value itself didn't match the query).
     * @param field name of the field that the facet values came from.
     * @param counts Lucene's list of facet values.
     * @return filtered set, only those matching the original query.
     */
    private NamedList filterCounts(String field, NamedList counts)
    {
        if (!field.endsWith("_facet"))
        {
            return counts;
        }
        // Trim off "_facet"
        String fieldBase = field.substring(0,field.length() - 6);
        // Builds fields to search against.
        // Note that original came from (e.g.) AUTHOR_facet.
        // And, original search would have been for INITIAL_AUTHOR_SRCH_boost 
as well as
        // SUBSEQUENT_AUTHOR_SRCH_boost (and fuzzy's).  However, we're only 
searching
        // one string at a time, so we'll shove it into the single-values 
INITIAL_xxx
        // fields.  That will be good enough for the Query to be able to 
correctly
        // evaluate against the document.
        String fieldBoost = "INITIAL_" + fieldBase + "_SRCH_boost";
        String fieldFuzzy = "INITIAL_" + fieldBase + "_SRCH_fuzzy";
        NamedList newCounts = new NamedList();

        IndexSchema schema = searcher.getSchema();
        SchemaField schemaField = schema.getField(fieldBoost);
        FieldType fieldType = schemaField.getType();
        Analyzer fieldAnalyzer = fieldType.getAnalyzer();

        SchemaField schemaFuzzyField = schema.getField(fieldFuzzy);
        FieldType fuzzyFieldType = schemaFuzzyField.getType();
        Analyzer fuzzyFieldAnalyzer = fuzzyFieldType.getAnalyzer();

        for (int i = 0; i < counts.size(); i++)
        {
            String testValue = counts.getName(i);
            MemoryIndex index = new MemoryIndex();
            index.addField(fieldBoost, testValue, fieldAnalyzer);
            index.addField(fieldFuzzy, testValue, fuzzyFieldAnalyzer);
            float score = index.search(rb.getQuery());
            if (score > 0.0f)
            {
                newCounts.add(testValue, counts.getVal(i));
            }
        }

        return newCounts;
    }

A bit of explanation on our schema will be in order here.

1) We've suffixed all our facet fields with "_facet" - hence that first if 
statement.
2) We have matching 'searchable' and 'facet' fields, names basically differ 
only in the suffix.  So, we strip off '_facet' and append '_boost' and '_fuzzy' 
(our two field types for searching against (and possibly applying boosts), and 
doing fuzzy matching against).  (You'll see it's not exactly that - but you can 
hopefully modify your version to match your schema)  Basically the idea is that 
we can derive the field name(s) against which the original search was issued 
from the facet field name.
3) You'll want to read up on the MemoryIndex class to see more about how it 
works, rather than me re-iterating that here.

[1] Caveats
1) We didn't do anything with the date type faceting, or with any ranges.
2) We didn't do anything with Facet prefix handling - it may or may not work if 
you need prefixes.
3) Anything else that facets do that we didn't handle - or at least, didn't 
test :)  As I say, it's a very special case for us, and this is in no way 
intended to be a general solution or fit for 'prime time' submission as a Solr 
enhancement.

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com

> -----Original Message-----
> From: Bill Bell [mailto:billnb...@gmail.com]
> Sent: Wednesday, June 22, 2011 3:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: MultiValued facet behavior question
> 
> You can type q=cardiology and match on cardiologist. If stemming did
> not
> work you can just add a synonym:
> 
> cardiology,cardiologist
> 
> But that is not the issue. The issue is around multiValue fields and
> facets. You would expect a user
> Who is searching on the multiValued field to match on some values in
> there. For example,
> they type "Cardiologist" and it matches on the value "Cardiologist". So
> it
> matches "in the multiValue field".
> So that part works. Then when I output the facet, I need a different
> behavior than the default. I need
> The facet to only output the value that matches (scored) - NOT ALL
> VALUES
> in the multiValued field.
> 
> I think it makes sense?
> 
> 
> On 6/22/11 1:42 AM, "Michael Kuhlmann" <s...@kuli.org> wrote:
> 
> >Am 22.06.2011 05:37, schrieb Bill Bell:
> >> It can get more complicated. Here is another example:
> >>
> >> q=cardiology&defType=dismax&qf=specialties
> >>
> >>
> >> (Cardiology and cardiologist are stems)...
> >>
> >> But I don't really know which value in Cardiologist match perfectly.
> >>
> >> Again, I only want it to return:
> >>
> >> Cardiologist: 3
> >
> >You would never get "Cardiologist: 3" as the facet result, because if
> >"Cardiologist" would be in your index, it's impossible to find it when
> >searching for "cardiology" (except when you manage to write some
> strange
> >tokenizer that translates "cardiology" to "Cardiologist" on query
> time,
> >including the upper case letter).
> >
> >Facets are always taken from the index, so they usually match exactly
> or
> >never when querying for it.
> >
> >-Kuli
> 
>

RE: MultiValued facet behavior question

Reply via email to