Yes I'm using the term "Analyzer" in a generic sense, also concerned about TokenStream init costs, garbage, etc.
There are a ton of uses here other than indexwriter, AnalyzingSuggesters building FSTs, etc etc. I don't think we need to try to add even more complexity because of users implementing these anti-patterns on their end. This problem of using hundreds/thousands of threads is a uniquely "java-developer" problem. I don't see these issues with applications written in other programming languages. We really can't shield them from it. If we stop reusing, they will just get different _symptoms_ (slow performance, GC issues, etc), but the underlying problem is using all those unnecessary threads. That's what should get fixed. On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward <romseyg...@gmail.com> wrote: > > Hey Robert, > > Analyzers themselves can be heavy and load large data files, etc, I agree, > but I’m really talking about token stream construction. The way things are > set up, we expect the heavy lifting to be done when the Analyzer is > constructed, but these heavy resources should then be shared between token > streams (I fixed a bug in the Ukrainian analyzer a while back that was > getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and > use the same instances at query and at index time, and there’s no worry about > reloading large dictionaries on every use. > > But re-using token streams is different. The reason we have > TokenStreamComponents and ReuseStrategies (as I understand it) is not because > they may have to load large resource files or dictionaries or whatever, but > it’s because building a TokenStream is itself quite a heavy operation due to > AttributeFactories and reflection. My argument is that this is only heavy > relative to the cost of indexing a single field, and that this only really > matters when you have documents with lots of small fields in them. For query > building or highlighting or MoreLlikeThis, the cost of building a small > number of token streams is tiny compared to all the other heavy lifting and > IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter > we wouldn’t have to have a close() method on Analyzer (because the thread > locals go away, and we expect file resources etc to be closed once the > analyzer has finished building itself), and delegating or wrapping analyzers > becomes much simpler. > > Does that make more sense? > > (I agree on the thread pool stuff, but we need to be careful about not > blowing up users systems even if they are implementing anti-patterns!) > > > On 8 Jun 2021, at 16:12, Robert Muir <rcm...@gmail.com> wrote: > > > > Alan: a couple thoughts: > > > > Analyzers are not just used for formulating queries, but also may be > > used by highlighters and other things on document results at query > > time. > > Some analyzers may do too-expensive/garbage-creating stuff on > > construction, that you wouldn't want to do at query-time. > > Separately, I think Analyzer being closable makes sense. > > Users still need to carefully consider the lifecycle of this thing for > > performance, and may want to return their own resources for some > > reason (close() is a non-final method today) > > Analyzers might require large amounts of resources (such as parsing > > files/lists, ml models, who knows what). > > For the built-in minimal resources that we ship, we try to make > > construction cheap and use static holder classes, and so on. I'm > > concerned some of these are costly. > > But I'm definitely worried about longer files and stuff that many > > users might use. > > > > I feel like some of this "large threadpool" stuff is just a java > > antipattern for search. I configure servers with fixed threadpools > > matching the number of CPU cores, and tell my load balancer about that > > number (e.g. haproxy maxconn), so that it can effectively queue and > > not overload search servers. > > > > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseyg...@gmail.com> wrote: > >> > >> Hi all, > >> > >> I’ve been on holiday and away from a keyboard for a week, so that means I > >> of course spent my time thinking about lucene Analyzers and specifically > >> their ReuseStrategies… > >> > >> Building a TokenStream can be quite a heavy operation, and so we try and > >> reuse already-constructed token streams as much as possible. This is > >> particularly important at index time, as having to create lots and lots of > >> very short-lived token streams for documents with many short text fields > >> could mean that we spend longer building these objects than we do pulling > >> data from them. To help support this, lucene Analyzers have a > >> ReuseStrategy, which defaults to storing a map of fields to token streams > >> in a ThreadLocal object. Because ThreadLocals can behave badly when it > >> comes to containers that have large thread pools, we use a special > >> CloseableThreadLocal class that can null out its contents once the > >> Analyzer is done with, and this leads to Analyzer itself being Closeable. > >> This makes extending analyzers more complicated, as delegating wrappers > >> need to ensure that they don’t end up sharing token streams with their > >> delegates. > >> > >> It’s common to use the same analyzer for indexing and for parsing user > >> queries. At query time, reusing token streams is a lot less important - > >> the amount of time spent building the query is typically much lower than > >> the amount of time spent rewriting and executing it. The fact that this > >> re-use is only really useful for index time and that the lifecycle of the > >> analyzer is therefore very closely tied to the lifecycle of its associated > >> IndexWriter makes me think that we should think about moving the re-use > >> strategies into IndexWriter itself. One option would be to have token > >> streams be constructed once per DocumentsWriterPerThread, which would lose > >> some re-use but would mean we could avoid ThreadLocals entirely. > >> > >> Any thoughts? > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org