Re: Analyzer lifecycles
Alan, I'd also like to comment on this: The reason we have TokenStreamComponents and ReuseStrategies (as I understand it) is not because they may have to load large resource files or dictionaries or whatever, but it’s because building a TokenStream is itself quite a heavy operation due to AttributeFactories and reflection. That's actually not my major concern. There are plenty of TokenStream.'s doing a fair amount of work, creating objects, setting up buffers, anything to make the actual processing fast. This makes sense today because they are reused. But if we *sometimes* reuse tokenstreams (indexwriter) and *other times dont* (query time), it just adds more pain to keeping the analyzers efficient. Now they have to optimize for 2 cases. I also don't want all this stuff added up to increased garbage for users that actually are search-heavy. Some of these users don't care about index speed at all and are more concerned with QPS, latencies, etc. This is very different from e.g. logging use-cases where people are just indexing all day and maybe rarely searching. Current analyzer design is efficient for both use-cases. On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward wrote: > > Hey Robert, > > Analyzers themselves can be heavy and load large data files, etc, I agree, > but I’m really talking about token stream construction. The way things are > set up, we expect the heavy lifting to be done when the Analyzer is > constructed, but these heavy resources should then be shared between token > streams (I fixed a bug in the Ukrainian analyzer a while back that was > getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and > use the same instances at query and at index time, and there’s no worry about > reloading large dictionaries on every use. > > But re-using token streams is different. The reason we have > TokenStreamComponents and ReuseStrategies (as I understand it) is not because > they may have to load large resource files or dictionaries or whatever, but > it’s because building a TokenStream is itself quite a heavy operation due to > AttributeFactories and reflection. My argument is that this is only heavy > relative to the cost of indexing a single field, and that this only really > matters when you have documents with lots of small fields in them. For query > building or highlighting or MoreLlikeThis, the cost of building a small > number of token streams is tiny compared to all the other heavy lifting and > IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter > we wouldn’t have to have a close() method on Analyzer (because the thread > locals go away, and we expect file resources etc to be closed once the > analyzer has finished building itself), and delegating or wrapping analyzers > becomes much simpler. > > Does that make more sense? > > (I agree on the thread pool stuff, but we need to be careful about not > blowing up users systems even if they are implementing anti-patterns!) > > > On 8 Jun 2021, at 16:12, Robert Muir wrote: > > > > Alan: a couple thoughts: > > > > Analyzers are not just used for formulating queries, but also may be > > used by highlighters and other things on document results at query > > time. > > Some analyzers may do too-expensive/garbage-creating stuff on > > construction, that you wouldn't want to do at query-time. > > Separately, I think Analyzer being closable makes sense. > > Users still need to carefully consider the lifecycle of this thing for > > performance, and may want to return their own resources for some > > reason (close() is a non-final method today) > > Analyzers might require large amounts of resources (such as parsing > > files/lists, ml models, who knows what). > > For the built-in minimal resources that we ship, we try to make > > construction cheap and use static holder classes, and so on. I'm > > concerned some of these are costly. > > But I'm definitely worried about longer files and stuff that many > > users might use. > > > > I feel like some of this "large threadpool" stuff is just a java > > antipattern for search. I configure servers with fixed threadpools > > matching the number of CPU cores, and tell my load balancer about that > > number (e.g. haproxy maxconn), so that it can effectively queue and > > not overload search servers. > > > > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward wrote: > >> > >> Hi all, > >> > >> I’ve been on holiday and away from a keyboard for a week, so that means I > >> of course spent my time thinking about lucene Analyzers and specifically > >> their ReuseStrategies… > >> > >> Building a TokenStream can be quite a heavy operation, and so we try and > >> reuse already-constructed token streams as much as possible. This is > >> particularly important at index time, as having to create lots and lots of > >> very short-lived token streams for documents with many short text fields > >> could mean that we spend longer building these objects than we
Re: Analyzer lifecycles
Yes I'm using the term "Analyzer" in a generic sense, also concerned about TokenStream init costs, garbage, etc. There are a ton of uses here other than indexwriter, AnalyzingSuggesters building FSTs, etc etc. I don't think we need to try to add even more complexity because of users implementing these anti-patterns on their end. This problem of using hundreds/thousands of threads is a uniquely "java-developer" problem. I don't see these issues with applications written in other programming languages. We really can't shield them from it. If we stop reusing, they will just get different _symptoms_ (slow performance, GC issues, etc), but the underlying problem is using all those unnecessary threads. That's what should get fixed. On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward wrote: > > Hey Robert, > > Analyzers themselves can be heavy and load large data files, etc, I agree, > but I’m really talking about token stream construction. The way things are > set up, we expect the heavy lifting to be done when the Analyzer is > constructed, but these heavy resources should then be shared between token > streams (I fixed a bug in the Ukrainian analyzer a while back that was > getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and > use the same instances at query and at index time, and there’s no worry about > reloading large dictionaries on every use. > > But re-using token streams is different. The reason we have > TokenStreamComponents and ReuseStrategies (as I understand it) is not because > they may have to load large resource files or dictionaries or whatever, but > it’s because building a TokenStream is itself quite a heavy operation due to > AttributeFactories and reflection. My argument is that this is only heavy > relative to the cost of indexing a single field, and that this only really > matters when you have documents with lots of small fields in them. For query > building or highlighting or MoreLlikeThis, the cost of building a small > number of token streams is tiny compared to all the other heavy lifting and > IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter > we wouldn’t have to have a close() method on Analyzer (because the thread > locals go away, and we expect file resources etc to be closed once the > analyzer has finished building itself), and delegating or wrapping analyzers > becomes much simpler. > > Does that make more sense? > > (I agree on the thread pool stuff, but we need to be careful about not > blowing up users systems even if they are implementing anti-patterns!) > > > On 8 Jun 2021, at 16:12, Robert Muir wrote: > > > > Alan: a couple thoughts: > > > > Analyzers are not just used for formulating queries, but also may be > > used by highlighters and other things on document results at query > > time. > > Some analyzers may do too-expensive/garbage-creating stuff on > > construction, that you wouldn't want to do at query-time. > > Separately, I think Analyzer being closable makes sense. > > Users still need to carefully consider the lifecycle of this thing for > > performance, and may want to return their own resources for some > > reason (close() is a non-final method today) > > Analyzers might require large amounts of resources (such as parsing > > files/lists, ml models, who knows what). > > For the built-in minimal resources that we ship, we try to make > > construction cheap and use static holder classes, and so on. I'm > > concerned some of these are costly. > > But I'm definitely worried about longer files and stuff that many > > users might use. > > > > I feel like some of this "large threadpool" stuff is just a java > > antipattern for search. I configure servers with fixed threadpools > > matching the number of CPU cores, and tell my load balancer about that > > number (e.g. haproxy maxconn), so that it can effectively queue and > > not overload search servers. > > > > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward wrote: > >> > >> Hi all, > >> > >> I’ve been on holiday and away from a keyboard for a week, so that means I > >> of course spent my time thinking about lucene Analyzers and specifically > >> their ReuseStrategies… > >> > >> Building a TokenStream can be quite a heavy operation, and so we try and > >> reuse already-constructed token streams as much as possible. This is > >> particularly important at index time, as having to create lots and lots of > >> very short-lived token streams for documents with many short text fields > >> could mean that we spend longer building these objects than we do pulling > >> data from them. To help support this, lucene Analyzers have a > >> ReuseStrategy, which defaults to storing a map of fields to token streams > >> in a ThreadLocal object. Because ThreadLocals can behave badly when it > >> comes to containers that have large thread pools, we use a special > >> CloseableThreadLocal class that can null out its contents once the > >> Analyzer
Re: Analyzer lifecycles
Hey Robert, Analyzers themselves can be heavy and load large data files, etc, I agree, but I’m really talking about token stream construction. The way things are set up, we expect the heavy lifting to be done when the Analyzer is constructed, but these heavy resources should then be shared between token streams (I fixed a bug in the Ukrainian analyzer a while back that was getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and use the same instances at query and at index time, and there’s no worry about reloading large dictionaries on every use. But re-using token streams is different. The reason we have TokenStreamComponents and ReuseStrategies (as I understand it) is not because they may have to load large resource files or dictionaries or whatever, but it’s because building a TokenStream is itself quite a heavy operation due to AttributeFactories and reflection. My argument is that this is only heavy relative to the cost of indexing a single field, and that this only really matters when you have documents with lots of small fields in them. For query building or highlighting or MoreLlikeThis, the cost of building a small number of token streams is tiny compared to all the other heavy lifting and IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter we wouldn’t have to have a close() method on Analyzer (because the thread locals go away, and we expect file resources etc to be closed once the analyzer has finished building itself), and delegating or wrapping analyzers becomes much simpler. Does that make more sense? (I agree on the thread pool stuff, but we need to be careful about not blowing up users systems even if they are implementing anti-patterns!) > On 8 Jun 2021, at 16:12, Robert Muir wrote: > > Alan: a couple thoughts: > > Analyzers are not just used for formulating queries, but also may be > used by highlighters and other things on document results at query > time. > Some analyzers may do too-expensive/garbage-creating stuff on > construction, that you wouldn't want to do at query-time. > Separately, I think Analyzer being closable makes sense. > Users still need to carefully consider the lifecycle of this thing for > performance, and may want to return their own resources for some > reason (close() is a non-final method today) > Analyzers might require large amounts of resources (such as parsing > files/lists, ml models, who knows what). > For the built-in minimal resources that we ship, we try to make > construction cheap and use static holder classes, and so on. I'm > concerned some of these are costly. > But I'm definitely worried about longer files and stuff that many > users might use. > > I feel like some of this "large threadpool" stuff is just a java > antipattern for search. I configure servers with fixed threadpools > matching the number of CPU cores, and tell my load balancer about that > number (e.g. haproxy maxconn), so that it can effectively queue and > not overload search servers. > > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward wrote: >> >> Hi all, >> >> I’ve been on holiday and away from a keyboard for a week, so that means I of >> course spent my time thinking about lucene Analyzers and specifically their >> ReuseStrategies… >> >> Building a TokenStream can be quite a heavy operation, and so we try and >> reuse already-constructed token streams as much as possible. This is >> particularly important at index time, as having to create lots and lots of >> very short-lived token streams for documents with many short text fields >> could mean that we spend longer building these objects than we do pulling >> data from them. To help support this, lucene Analyzers have a >> ReuseStrategy, which defaults to storing a map of fields to token streams in >> a ThreadLocal object. Because ThreadLocals can behave badly when it comes >> to containers that have large thread pools, we use a special >> CloseableThreadLocal class that can null out its contents once the Analyzer >> is done with, and this leads to Analyzer itself being Closeable. This makes >> extending analyzers more complicated, as delegating wrappers need to ensure >> that they don’t end up sharing token streams with their delegates. >> >> It’s common to use the same analyzer for indexing and for parsing user >> queries. At query time, reusing token streams is a lot less important - the >> amount of time spent building the query is typically much lower than the >> amount of time spent rewriting and executing it. The fact that this re-use >> is only really useful for index time and that the lifecycle of the analyzer >> is therefore very closely tied to the lifecycle of its associated >> IndexWriter makes me think that we should think about moving the re-use >> strategies into IndexWriter itself. One option would be to have token >> streams be constructed once per DocumentsWriterPerThread, which would lose >> some
Re: Analyzer lifecycles
Alan: a couple thoughts: Analyzers are not just used for formulating queries, but also may be used by highlighters and other things on document results at query time. Some analyzers may do too-expensive/garbage-creating stuff on construction, that you wouldn't want to do at query-time. Separately, I think Analyzer being closable makes sense. Users still need to carefully consider the lifecycle of this thing for performance, and may want to return their own resources for some reason (close() is a non-final method today) Analyzers might require large amounts of resources (such as parsing files/lists, ml models, who knows what). For the built-in minimal resources that we ship, we try to make construction cheap and use static holder classes, and so on. I'm concerned some of these are costly. But I'm definitely worried about longer files and stuff that many users might use. I feel like some of this "large threadpool" stuff is just a java antipattern for search. I configure servers with fixed threadpools matching the number of CPU cores, and tell my load balancer about that number (e.g. haproxy maxconn), so that it can effectively queue and not overload search servers. On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward wrote: > > Hi all, > > I’ve been on holiday and away from a keyboard for a week, so that means I of > course spent my time thinking about lucene Analyzers and specifically their > ReuseStrategies… > > Building a TokenStream can be quite a heavy operation, and so we try and > reuse already-constructed token streams as much as possible. This is > particularly important at index time, as having to create lots and lots of > very short-lived token streams for documents with many short text fields > could mean that we spend longer building these objects than we do pulling > data from them. To help support this, lucene Analyzers have a ReuseStrategy, > which defaults to storing a map of fields to token streams in a ThreadLocal > object. Because ThreadLocals can behave badly when it comes to containers > that have large thread pools, we use a special CloseableThreadLocal class > that can null out its contents once the Analyzer is done with, and this leads > to Analyzer itself being Closeable. This makes extending analyzers more > complicated, as delegating wrappers need to ensure that they don’t end up > sharing token streams with their delegates. > > It’s common to use the same analyzer for indexing and for parsing user > queries. At query time, reusing token streams is a lot less important - the > amount of time spent building the query is typically much lower than the > amount of time spent rewriting and executing it. The fact that this re-use > is only really useful for index time and that the lifecycle of the analyzer > is therefore very closely tied to the lifecycle of its associated IndexWriter > makes me think that we should think about moving the re-use strategies into > IndexWriter itself. One option would be to have token streams be constructed > once per DocumentsWriterPerThread, which would lose some re-use but would > mean we could avoid ThreadLocals entirely. > > Any thoughts? > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Analyzer lifecycles
Hi all, I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies… Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates. It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely. Any thoughts? - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org