Alan, I'd also like to comment on this:

The reason we have TokenStreamComponents and ReuseStrategies (as I
understand it) is not because they may have to load large resource
files or dictionaries or whatever, but it’s because building a
TokenStream is itself quite a heavy operation due to
AttributeFactories and reflection.

That's actually not my major concern. There are plenty of
TokenStream.<init>'s doing a fair amount of work, creating objects,
setting up buffers, anything to make the actual processing fast. This
makes sense today because they are reused. But if we *sometimes* reuse
tokenstreams (indexwriter) and *other times dont* (query time), it
just adds more pain to keeping the analyzers efficient. Now they have
to optimize for 2 cases.

I also don't want all this stuff added up to increased garbage for
users that actually are search-heavy. Some of these users don't care
about index speed at all and are more concerned with QPS, latencies,
etc. This is very different from e.g. logging use-cases where people
are just indexing all day and maybe rarely searching. Current analyzer
design is efficient for both use-cases.

On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward <romseyg...@gmail.com> wrote:
>
> Hey Robert,
>
> Analyzers themselves can be heavy and load large data files, etc, I agree, 
> but I’m really talking about token stream construction.  The way things are 
> set up, we expect the heavy lifting to be done when the Analyzer is 
> constructed, but these heavy resources should then be shared between token 
> streams (I fixed a bug in the Ukrainian analyzer a while back that was 
> getting this wrong, see LUCENE-9930).  So you’d build your Analyzers once and 
> use the same instances at query and at index time, and there’s no worry about 
> reloading large dictionaries on every use.
>
> But re-using token streams is different. The reason we have 
> TokenStreamComponents and ReuseStrategies (as I understand it) is not because 
> they may have to load large resource files or dictionaries or whatever, but 
> it’s because building a TokenStream is itself quite a heavy operation due to 
> AttributeFactories and reflection.  My argument is that this is only heavy 
> relative to the cost of indexing a single field, and that this only really 
> matters when you have documents with lots of small fields in them.  For query 
> building or highlighting or MoreLlikeThis, the cost of building a small 
> number of token streams is tiny compared to all the other heavy lifting and 
> IO going on.  And so if we pushed this *TokenStream* reuse into IndexWriter 
> we wouldn’t have to have a close() method on Analyzer (because the thread 
> locals go away, and we expect file resources etc to be closed once the 
> analyzer has finished building itself), and delegating or wrapping analyzers 
> becomes much simpler.
>
> Does that make more sense?
>
> (I agree on the thread pool stuff, but we need to be careful about not 
> blowing up users systems even if they are implementing anti-patterns!)
>
> > On 8 Jun 2021, at 16:12, Robert Muir <rcm...@gmail.com> wrote:
> >
> > Alan: a couple thoughts:
> >
> > Analyzers are not just used for formulating queries, but also may be
> > used by highlighters and other things on document results at query
> > time.
> > Some analyzers may do too-expensive/garbage-creating stuff on
> > construction, that you wouldn't want to do at query-time.
> > Separately, I think Analyzer being closable makes sense.
> > Users still need to carefully consider the lifecycle of this thing for
> > performance, and may want to return their own resources for some
> > reason (close() is a non-final method today)
> > Analyzers might require large amounts of resources (such as parsing
> > files/lists, ml models, who knows what).
> > For the built-in minimal resources that we ship, we try to make
> > construction cheap and use static holder classes, and so on. I'm
> > concerned some of these are costly.
> > But I'm definitely worried about longer files and stuff that many
> > users might use.
> >
> > I feel like some of this "large threadpool" stuff is just a java
> > antipattern for search. I configure servers with fixed threadpools
> > matching the number of CPU cores, and tell my load balancer about that
> > number (e.g. haproxy maxconn), so that it can effectively queue and
> > not overload search servers.
> >
> > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseyg...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I’ve been on holiday and away from a keyboard for a week, so that means I 
> >> of course spent my time thinking about lucene Analyzers and specifically 
> >> their ReuseStrategies…
> >>
> >> Building a TokenStream can be quite a heavy operation, and so we try and 
> >> reuse already-constructed token streams as much as possible.  This is 
> >> particularly important at index time, as having to create lots and lots of 
> >> very short-lived token streams for documents with many short text fields 
> >> could mean that we spend longer building these objects than we do pulling 
> >> data from them.  To help support this, lucene Analyzers have a 
> >> ReuseStrategy, which defaults to storing a map of fields to token streams 
> >> in a ThreadLocal object.  Because ThreadLocals can behave badly when it 
> >> comes to containers that have large thread pools, we use a special 
> >> CloseableThreadLocal class that can null out its contents once the 
> >> Analyzer is done with, and this leads to Analyzer itself being Closeable.  
> >> This makes extending analyzers more complicated, as delegating wrappers 
> >> need to ensure that they don’t end up sharing token streams with their 
> >> delegates.
> >>
> >> It’s common to use the same analyzer for indexing and for parsing user 
> >> queries.  At query time, reusing token streams is a lot less important - 
> >> the amount of time spent building the query is typically much lower than 
> >> the amount of time spent rewriting and executing it.  The fact that this 
> >> re-use is only really useful for index time and that the lifecycle of the 
> >> analyzer is therefore very closely tied to the lifecycle of its associated 
> >> IndexWriter makes me think that we should think about moving the re-use 
> >> strategies into IndexWriter itself.  One option would be to have token 
> >> streams be constructed once per DocumentsWriterPerThread, which would lose 
> >> some re-use but would mean we could avoid ThreadLocals entirely.
> >>
> >> Any thoughts?
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to