Re: Analyzer lifecycles

Alan Woodward Wed, 09 Jun 2021 06:03:46 -0700

Hey Robert,

Analyzers themselves can be heavy and load large data files, etc, I agree, but 
I’m really talking about token stream construction.  The way things are set up, 
we expect the heavy lifting to be done when the Analyzer is constructed, but 
these heavy resources should then be shared between token streams (I fixed a 
bug in the Ukrainian analyzer a while back that was getting this wrong, see 
LUCENE-9930).  So you’d build your Analyzers once and use the same instances at 
query and at index time, and there’s no worry about reloading large 
dictionaries on every use.


But re-using token streams is different. The reason we have 
TokenStreamComponents and ReuseStrategies (as I understand it) is not because 
they may have to load large resource files or dictionaries or whatever, but 
it’s because building a TokenStream is itself quite a heavy operation due to 
AttributeFactories and reflection.  My argument is that this is only heavy 
relative to the cost of indexing a single field, and that this only really 
matters when you have documents with lots of small fields in them.  For query 
building or highlighting or MoreLlikeThis, the cost of building a small number 
of token streams is tiny compared to all the other heavy lifting and IO going 
on.  And so if we pushed this *TokenStream* reuse into IndexWriter we wouldn’t 
have to have a close() method on Analyzer (because the thread locals go away, 
and we expect file resources etc to be closed once the analyzer has finished 
building itself), and delegating or wrapping analyzers becomes much simpler.

Does that make more sense?

(I agree on the thread pool stuff, but we need to be careful about not blowing 
up users systems even if they are implementing anti-patterns!)

> On 8 Jun 2021, at 16:12, Robert Muir <rcm...@gmail.com> wrote:
> 
> Alan: a couple thoughts:
> 
> Analyzers are not just used for formulating queries, but also may be
> used by highlighters and other things on document results at query
> time.
> Some analyzers may do too-expensive/garbage-creating stuff on
> construction, that you wouldn't want to do at query-time.
> Separately, I think Analyzer being closable makes sense.
> Users still need to carefully consider the lifecycle of this thing for
> performance, and may want to return their own resources for some
> reason (close() is a non-final method today)
> Analyzers might require large amounts of resources (such as parsing
> files/lists, ml models, who knows what).
> For the built-in minimal resources that we ship, we try to make
> construction cheap and use static holder classes, and so on. I'm
> concerned some of these are costly.
> But I'm definitely worried about longer files and stuff that many
> users might use.
> 
> I feel like some of this "large threadpool" stuff is just a java
> antipattern for search. I configure servers with fixed threadpools
> matching the number of CPU cores, and tell my load balancer about that
> number (e.g. haproxy maxconn), so that it can effectively queue and
> not overload search servers.
> 
> On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseyg...@gmail.com> wrote:
>> 
>> Hi all,
>> 
>> I’ve been on holiday and away from a keyboard for a week, so that means I of 
>> course spent my time thinking about lucene Analyzers and specifically their 
>> ReuseStrategies…
>> 
>> Building a TokenStream can be quite a heavy operation, and so we try and 
>> reuse already-constructed token streams as much as possible.  This is 
>> particularly important at index time, as having to create lots and lots of 
>> very short-lived token streams for documents with many short text fields 
>> could mean that we spend longer building these objects than we do pulling 
>> data from them.  To help support this, lucene Analyzers have a 
>> ReuseStrategy, which defaults to storing a map of fields to token streams in 
>> a ThreadLocal object.  Because ThreadLocals can behave badly when it comes 
>> to containers that have large thread pools, we use a special 
>> CloseableThreadLocal class that can null out its contents once the Analyzer 
>> is done with, and this leads to Analyzer itself being Closeable.  This makes 
>> extending analyzers more complicated, as delegating wrappers need to ensure 
>> that they don’t end up sharing token streams with their delegates.
>> 
>> It’s common to use the same analyzer for indexing and for parsing user 
>> queries.  At query time, reusing token streams is a lot less important - the 
>> amount of time spent building the query is typically much lower than the 
>> amount of time spent rewriting and executing it.  The fact that this re-use 
>> is only really useful for index time and that the lifecycle of the analyzer 
>> is therefore very closely tied to the lifecycle of its associated 
>> IndexWriter makes me think that we should think about moving the re-use 
>> strategies into IndexWriter itself.  One option would be to have token 
>> streams be constructed once per DocumentsWriterPerThread, which would lose 
>> some re-use but would mean we could avoid ThreadLocals entirely.
>> 
>> Any thoughts?
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Analyzer lifecycles

Reply via email to