Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea

Erik Hatcher Wed, 27 Dec 2006 19:38:13 -0800

JJ:  Fantastic - this is excellent info, and sharing it helps a LOT!


        Erik


On Dec 27, 2006, at 7:25 PM, Apache Wiki wrote:

Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki"for change notification.
The following page has been changed by JJLarrea:
http://wiki.apache.org/solr/SolrFacetingOverview

The comment on the change is:
Added page per 12/8/06 suggestion by Yonik

New page:
= Faceting Overview =
Solr provides a [http://incubator.apache.org/solr/docs/api/org/apache/solr/request/SimpleFacets.html Simple Faceting toolkit]which can be reused by various Request Handlers to include "Facetcounts" of based on some simple criteria. Both theStandardRequestHandler and the DisMaxRequestHandler currently usethese utilities. Detailed descriptions of the parameters used tocontrol faceting can be found (along with several examples) at[SimpleFacetParameters].
This page briefly provides some general background information:

= Facet Indexing =
Faceting is done on __indexed__ rather than __stored__ values.This is because the primary use for faceting is drilldown into asubset of hits resulting from a query, and so the chosen facetvalue is used to construct a filter query which literally matchesthat value in the index. For the stock Solr request handlers thisis done by adding an `fq=<facet-field>:<quoted facet-value>`parameter and resubmitting the query.
Because faceting fields are often specified to serve two purposes,human-readable text and drill-down query value, they are frequentlyindexed differently from fields used for searching and sorting:
  * They are not tokenized into separate words
  * They are not mapped into lower case
* Human-readable punctuation is not removed (other than double-quotes)* There is often no need to store them, since stored values wouldlook much like indexed values and the faceting mechanism is usedfor value retrieval.* Depending on how the field is defined the SimpleFacetsmechanism may only allow for a single value per field per document(see below)
As an example, if I had a field with a list of authors, such as:

  Schildt, Herbert; Wolpert, Lewis; Davies, P.
I might want to index the same data differently in three differentfields (perhaps using the Solr [:SchemaXml#Copy Fields:copyField]directive):
  * For searching: Tokenized, case-folded, punctuation-stripped:
      schildt / herbert / wolpert / lewis / davies / p
  * For sorting: Untokenized, case-folded, punctuation-stripped:
      schildt herbert wolpert lewis davies p
  * For faceting: Primary author only, using a `solr.StringField`:
      Schildt, Herbert
Then when the user drills down on the "Schildt, Herbert" string Iwould reissue the query with an added fq="Schild, Herbert" parameter.
= Facet Operation =

Currently SimpleFacets has 3 modes of operation:

== FacetQueries ==
Any number of [:SimpleFacetParameters#facet.query:facet.query]parameters can be passed to the request handler. Each distinctfacet.query will first be executed against the entire index, withthe results cached as a hashed set (if fewer than hashDocSet) or abit set (if greater) of document IDs (see [:SolrCaching#ThehashDocSet Max Size:hashDocSet]). Then every time that facet.queryis used for faceting a query, the cached set will be intersectedagainst the set of document ids returned by the query to count thenumber of documents for which the facet.query condition is true.
== FacetFields ==
Any number of [:SimpleFacetParameters#facet.field:facet.field]parameters can be passed to the request handler. For eachfacet.field, one of two approaches will be used:
* Field Queries: If the facet field is defined in the schemaas multi-valued, boolean, or tokenized, then every indexed valuefor the field will be iterated and a facet query will be executedand cached (as described above). This is excellent for fieldswhere there is a small set of distinct values. For example,faceting on a field with U.S. States eg. `Alabama, Alaska, ...Wyoming` would lead to fifty cached queries which would be usedover and over again. It also works in the case when the facetfield can have multiple values for each document. However, itrequires excessive amounts of memory and time when the number offield values is large and especially when it exceeds the filtercache size defined in [:SolrCaching#filterCache:filterCache]
* Field Cache: If the facet field is not tokenized, not multi-valued, and not boolean, then a field-cache approach will be used.This is currently implemented with the Lucene [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html FieldCache] mechanism used for results sorting. Anarray of integers (one for every document in the index) isallocated, pre-filled with the first indexed value for that fieldin each document (offset into a table of strings for fields indexedas strings), and cached. Every time that facet.field is used forfaceting a query, all the document IDs resulting from the query arelooked up in the field cache and any value found has its tallyincremented. This is excellent for situations where the number ofindexed values for the field is too large to be practical using thefield queries mechanism, such as faceting against authors ortitles. However it is currently much slower and more memory-intensive than the field query
  mechanism for fields with a small number of values.
Note at this time there is no way to manually control whetherfacet.field is handled via field queries or field cache other thandefining in the schema whether the field is single- or multi-valuedand the analyzer used: `solr.TextField` is always tokenized while`solr.StrField` is never. Control may be improved in the future,along with a means to handle multi-valued fields with a variant ofthe Field Cache mechanism.
----
CategoryCategory

Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea

Reply via email to