Potential Wiki text on the lifecycle of an analysis component

Benson Margulies Fri, 14 Jun 2013 05:08:17 -0700

I'd like to post some documentation to help other people trying to
deal with thread-safety and lifetime issues on analysis components.


Here is what I think I know, based on corrections here I'll post something.

Each Solr core has a schema. By default, Solr create a schema when it
creates a core. If, however, shared schemas are enabled, then Solr
maintains a map from schema names to schema, and cores that declare
the same schema (via the name attribute in the schema XML file) share
the schema object.

The schema declares a set of field types. Each field type is
represented by an object of some class that inherits from
org.apache.solr.schema.FieldType.

This class optionally stored two analyzers: the 'analyzer' for
indexing, and the queryAnalyzer for queries.

If a field type is declared with an <analyzer> element that has no
class name attribute, Solr creates an analyzer of type
org.apache.solr.analysis.TokenizerChain. These objects store a
TokenizerFactory, a list of TokenFilterFactories, and a list of
CharFilterFactories. They deliver, upon request, a java.io.Reader
build from the char filters or a TokenStreamComponents object
containing a new tokenizer and filter set.

Solr typically runs in a multi-threaded servlet container, so each
Solr request runs in the the container thread that handled the HTTP
request. For an update request, DocInverterPerField will call
Field.tokenStream to get a new token stream. It calls close() on that
field when it is done (c.f. LUCENE-2145, which notes that this only
closes the internal reader). So there is a new set of analysis
components for each field for each request.

For a query, the analysis components are, not too surprisingly,
created by the query parser, since it is the query parser that must
split any relevant strings into their constituent elements.

To summarize, then, here is the typical situation.

The core has a schema. This lives for the length of the core, or in
the shared case, the core container.

The schema has field types.

Each field type has two analyzers. All of this, so far, has the
lifetime of the schema.

At update time, the analyzer is called upon to create tokenization
components with the lifetime of processing a single document.

At query time, the query analyzer is called upon to create
tokenization components with the lifetime of processing one field of
the query.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Potential Wiki text on the lifecycle of an analysis component

Reply via email to