Re: Analyzer lifecycles

2021-06-09 Thread Robert Muir
Alan, I'd also like to comment on this:

The reason we have TokenStreamComponents and ReuseStrategies (as I
understand it) is not because they may have to load large resource
files or dictionaries or whatever, but it’s because building a
TokenStream is itself quite a heavy operation due to
AttributeFactories and reflection.

That's actually not my major concern. There are plenty of
TokenStream.'s doing a fair amount of work, creating objects,
setting up buffers, anything to make the actual processing fast. This
makes sense today because they are reused. But if we *sometimes* reuse
tokenstreams (indexwriter) and *other times dont* (query time), it
just adds more pain to keeping the analyzers efficient. Now they have
to optimize for 2 cases.

I also don't want all this stuff added up to increased garbage for
users that actually are search-heavy. Some of these users don't care
about index speed at all and are more concerned with QPS, latencies,
etc. This is very different from e.g. logging use-cases where people
are just indexing all day and maybe rarely searching. Current analyzer
design is efficient for both use-cases.

On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward  wrote:
>
> Hey Robert,
>
> Analyzers themselves can be heavy and load large data files, etc, I agree, 
> but I’m really talking about token stream construction.  The way things are 
> set up, we expect the heavy lifting to be done when the Analyzer is 
> constructed, but these heavy resources should then be shared between token 
> streams (I fixed a bug in the Ukrainian analyzer a while back that was 
> getting this wrong, see LUCENE-9930).  So you’d build your Analyzers once and 
> use the same instances at query and at index time, and there’s no worry about 
> reloading large dictionaries on every use.
>
> But re-using token streams is different. The reason we have 
> TokenStreamComponents and ReuseStrategies (as I understand it) is not because 
> they may have to load large resource files or dictionaries or whatever, but 
> it’s because building a TokenStream is itself quite a heavy operation due to 
> AttributeFactories and reflection.  My argument is that this is only heavy 
> relative to the cost of indexing a single field, and that this only really 
> matters when you have documents with lots of small fields in them.  For query 
> building or highlighting or MoreLlikeThis, the cost of building a small 
> number of token streams is tiny compared to all the other heavy lifting and 
> IO going on.  And so if we pushed this *TokenStream* reuse into IndexWriter 
> we wouldn’t have to have a close() method on Analyzer (because the thread 
> locals go away, and we expect file resources etc to be closed once the 
> analyzer has finished building itself), and delegating or wrapping analyzers 
> becomes much simpler.
>
> Does that make more sense?
>
> (I agree on the thread pool stuff, but we need to be careful about not 
> blowing up users systems even if they are implementing anti-patterns!)
>
> > On 8 Jun 2021, at 16:12, Robert Muir  wrote:
> >
> > Alan: a couple thoughts:
> >
> > Analyzers are not just used for formulating queries, but also may be
> > used by highlighters and other things on document results at query
> > time.
> > Some analyzers may do too-expensive/garbage-creating stuff on
> > construction, that you wouldn't want to do at query-time.
> > Separately, I think Analyzer being closable makes sense.
> > Users still need to carefully consider the lifecycle of this thing for
> > performance, and may want to return their own resources for some
> > reason (close() is a non-final method today)
> > Analyzers might require large amounts of resources (such as parsing
> > files/lists, ml models, who knows what).
> > For the built-in minimal resources that we ship, we try to make
> > construction cheap and use static holder classes, and so on. I'm
> > concerned some of these are costly.
> > But I'm definitely worried about longer files and stuff that many
> > users might use.
> >
> > I feel like some of this "large threadpool" stuff is just a java
> > antipattern for search. I configure servers with fixed threadpools
> > matching the number of CPU cores, and tell my load balancer about that
> > number (e.g. haproxy maxconn), so that it can effectively queue and
> > not overload search servers.
> >
> > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward  wrote:
> >>
> >> Hi all,
> >>
> >> I’ve been on holiday and away from a keyboard for a week, so that means I 
> >> of course spent my time thinking about lucene Analyzers and specifically 
> >> their ReuseStrategies…
> >>
> >> Building a TokenStream can be quite a heavy operation, and so we try and 
> >> reuse already-constructed token streams as much as possible.  This is 
> >> particularly important at index time, as having to create lots and lots of 
> >> very short-lived token streams for documents with many short text fields 
> >> could mean that we spend longer building these objects than we 

Re: Analyzer lifecycles

2021-06-09 Thread Robert Muir
Yes I'm using the term "Analyzer" in a generic sense, also concerned
about TokenStream init costs, garbage, etc.

There are a ton of uses here other than indexwriter,
AnalyzingSuggesters building FSTs, etc etc.

I don't think we need to try to add even more complexity because of
users implementing these anti-patterns on their end. This problem of
using hundreds/thousands of threads is a uniquely "java-developer"
problem. I don't see these issues with applications written in other
programming languages. We really can't shield them from it. If we stop
reusing, they will just get different _symptoms_ (slow performance, GC
issues, etc), but the underlying problem is using all those
unnecessary threads. That's what should get fixed.

On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward  wrote:
>
> Hey Robert,
>
> Analyzers themselves can be heavy and load large data files, etc, I agree, 
> but I’m really talking about token stream construction.  The way things are 
> set up, we expect the heavy lifting to be done when the Analyzer is 
> constructed, but these heavy resources should then be shared between token 
> streams (I fixed a bug in the Ukrainian analyzer a while back that was 
> getting this wrong, see LUCENE-9930).  So you’d build your Analyzers once and 
> use the same instances at query and at index time, and there’s no worry about 
> reloading large dictionaries on every use.
>
> But re-using token streams is different. The reason we have 
> TokenStreamComponents and ReuseStrategies (as I understand it) is not because 
> they may have to load large resource files or dictionaries or whatever, but 
> it’s because building a TokenStream is itself quite a heavy operation due to 
> AttributeFactories and reflection.  My argument is that this is only heavy 
> relative to the cost of indexing a single field, and that this only really 
> matters when you have documents with lots of small fields in them.  For query 
> building or highlighting or MoreLlikeThis, the cost of building a small 
> number of token streams is tiny compared to all the other heavy lifting and 
> IO going on.  And so if we pushed this *TokenStream* reuse into IndexWriter 
> we wouldn’t have to have a close() method on Analyzer (because the thread 
> locals go away, and we expect file resources etc to be closed once the 
> analyzer has finished building itself), and delegating or wrapping analyzers 
> becomes much simpler.
>
> Does that make more sense?
>
> (I agree on the thread pool stuff, but we need to be careful about not 
> blowing up users systems even if they are implementing anti-patterns!)
>
> > On 8 Jun 2021, at 16:12, Robert Muir  wrote:
> >
> > Alan: a couple thoughts:
> >
> > Analyzers are not just used for formulating queries, but also may be
> > used by highlighters and other things on document results at query
> > time.
> > Some analyzers may do too-expensive/garbage-creating stuff on
> > construction, that you wouldn't want to do at query-time.
> > Separately, I think Analyzer being closable makes sense.
> > Users still need to carefully consider the lifecycle of this thing for
> > performance, and may want to return their own resources for some
> > reason (close() is a non-final method today)
> > Analyzers might require large amounts of resources (such as parsing
> > files/lists, ml models, who knows what).
> > For the built-in minimal resources that we ship, we try to make
> > construction cheap and use static holder classes, and so on. I'm
> > concerned some of these are costly.
> > But I'm definitely worried about longer files and stuff that many
> > users might use.
> >
> > I feel like some of this "large threadpool" stuff is just a java
> > antipattern for search. I configure servers with fixed threadpools
> > matching the number of CPU cores, and tell my load balancer about that
> > number (e.g. haproxy maxconn), so that it can effectively queue and
> > not overload search servers.
> >
> > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward  wrote:
> >>
> >> Hi all,
> >>
> >> I’ve been on holiday and away from a keyboard for a week, so that means I 
> >> of course spent my time thinking about lucene Analyzers and specifically 
> >> their ReuseStrategies…
> >>
> >> Building a TokenStream can be quite a heavy operation, and so we try and 
> >> reuse already-constructed token streams as much as possible.  This is 
> >> particularly important at index time, as having to create lots and lots of 
> >> very short-lived token streams for documents with many short text fields 
> >> could mean that we spend longer building these objects than we do pulling 
> >> data from them.  To help support this, lucene Analyzers have a 
> >> ReuseStrategy, which defaults to storing a map of fields to token streams 
> >> in a ThreadLocal object.  Because ThreadLocals can behave badly when it 
> >> comes to containers that have large thread pools, we use a special 
> >> CloseableThreadLocal class that can null out its contents once the 
> >> Analyzer 

Re: Analyzer lifecycles

2021-06-09 Thread Alan Woodward
Hey Robert,

Analyzers themselves can be heavy and load large data files, etc, I agree, but 
I’m really talking about token stream construction.  The way things are set up, 
we expect the heavy lifting to be done when the Analyzer is constructed, but 
these heavy resources should then be shared between token streams (I fixed a 
bug in the Ukrainian analyzer a while back that was getting this wrong, see 
LUCENE-9930).  So you’d build your Analyzers once and use the same instances at 
query and at index time, and there’s no worry about reloading large 
dictionaries on every use.

But re-using token streams is different. The reason we have 
TokenStreamComponents and ReuseStrategies (as I understand it) is not because 
they may have to load large resource files or dictionaries or whatever, but 
it’s because building a TokenStream is itself quite a heavy operation due to 
AttributeFactories and reflection.  My argument is that this is only heavy 
relative to the cost of indexing a single field, and that this only really 
matters when you have documents with lots of small fields in them.  For query 
building or highlighting or MoreLlikeThis, the cost of building a small number 
of token streams is tiny compared to all the other heavy lifting and IO going 
on.  And so if we pushed this *TokenStream* reuse into IndexWriter we wouldn’t 
have to have a close() method on Analyzer (because the thread locals go away, 
and we expect file resources etc to be closed once the analyzer has finished 
building itself), and delegating or wrapping analyzers becomes much simpler.

Does that make more sense?

(I agree on the thread pool stuff, but we need to be careful about not blowing 
up users systems even if they are implementing anti-patterns!)

> On 8 Jun 2021, at 16:12, Robert Muir  wrote:
> 
> Alan: a couple thoughts:
> 
> Analyzers are not just used for formulating queries, but also may be
> used by highlighters and other things on document results at query
> time.
> Some analyzers may do too-expensive/garbage-creating stuff on
> construction, that you wouldn't want to do at query-time.
> Separately, I think Analyzer being closable makes sense.
> Users still need to carefully consider the lifecycle of this thing for
> performance, and may want to return their own resources for some
> reason (close() is a non-final method today)
> Analyzers might require large amounts of resources (such as parsing
> files/lists, ml models, who knows what).
> For the built-in minimal resources that we ship, we try to make
> construction cheap and use static holder classes, and so on. I'm
> concerned some of these are costly.
> But I'm definitely worried about longer files and stuff that many
> users might use.
> 
> I feel like some of this "large threadpool" stuff is just a java
> antipattern for search. I configure servers with fixed threadpools
> matching the number of CPU cores, and tell my load balancer about that
> number (e.g. haproxy maxconn), so that it can effectively queue and
> not overload search servers.
> 
> On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward  wrote:
>> 
>> Hi all,
>> 
>> I’ve been on holiday and away from a keyboard for a week, so that means I of 
>> course spent my time thinking about lucene Analyzers and specifically their 
>> ReuseStrategies…
>> 
>> Building a TokenStream can be quite a heavy operation, and so we try and 
>> reuse already-constructed token streams as much as possible.  This is 
>> particularly important at index time, as having to create lots and lots of 
>> very short-lived token streams for documents with many short text fields 
>> could mean that we spend longer building these objects than we do pulling 
>> data from them.  To help support this, lucene Analyzers have a 
>> ReuseStrategy, which defaults to storing a map of fields to token streams in 
>> a ThreadLocal object.  Because ThreadLocals can behave badly when it comes 
>> to containers that have large thread pools, we use a special 
>> CloseableThreadLocal class that can null out its contents once the Analyzer 
>> is done with, and this leads to Analyzer itself being Closeable.  This makes 
>> extending analyzers more complicated, as delegating wrappers need to ensure 
>> that they don’t end up sharing token streams with their delegates.
>> 
>> It’s common to use the same analyzer for indexing and for parsing user 
>> queries.  At query time, reusing token streams is a lot less important - the 
>> amount of time spent building the query is typically much lower than the 
>> amount of time spent rewriting and executing it.  The fact that this re-use 
>> is only really useful for index time and that the lifecycle of the analyzer 
>> is therefore very closely tied to the lifecycle of its associated 
>> IndexWriter makes me think that we should think about moving the re-use 
>> strategies into IndexWriter itself.  One option would be to have token 
>> streams be constructed once per DocumentsWriterPerThread, which would lose 
>> some 

Re: Analyzer lifecycles

2021-06-08 Thread Robert Muir
Alan: a couple thoughts:

Analyzers are not just used for formulating queries, but also may be
used by highlighters and other things on document results at query
time.
Some analyzers may do too-expensive/garbage-creating stuff on
construction, that you wouldn't want to do at query-time.
Separately, I think Analyzer being closable makes sense.
Users still need to carefully consider the lifecycle of this thing for
performance, and may want to return their own resources for some
reason (close() is a non-final method today)
Analyzers might require large amounts of resources (such as parsing
files/lists, ml models, who knows what).
For the built-in minimal resources that we ship, we try to make
construction cheap and use static holder classes, and so on. I'm
concerned some of these are costly.
But I'm definitely worried about longer files and stuff that many
users might use.

I feel like some of this "large threadpool" stuff is just a java
antipattern for search. I configure servers with fixed threadpools
matching the number of CPU cores, and tell my load balancer about that
number (e.g. haproxy maxconn), so that it can effectively queue and
not overload search servers.

On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward  wrote:
>
> Hi all,
>
> I’ve been on holiday and away from a keyboard for a week, so that means I of 
> course spent my time thinking about lucene Analyzers and specifically their 
> ReuseStrategies…
>
> Building a TokenStream can be quite a heavy operation, and so we try and 
> reuse already-constructed token streams as much as possible.  This is 
> particularly important at index time, as having to create lots and lots of 
> very short-lived token streams for documents with many short text fields 
> could mean that we spend longer building these objects than we do pulling 
> data from them.  To help support this, lucene Analyzers have a ReuseStrategy, 
> which defaults to storing a map of fields to token streams in a ThreadLocal 
> object.  Because ThreadLocals can behave badly when it comes to containers 
> that have large thread pools, we use a special CloseableThreadLocal class 
> that can null out its contents once the Analyzer is done with, and this leads 
> to Analyzer itself being Closeable.  This makes extending analyzers more 
> complicated, as delegating wrappers need to ensure that they don’t end up 
> sharing token streams with their delegates.
>
> It’s common to use the same analyzer for indexing and for parsing user 
> queries.  At query time, reusing token streams is a lot less important - the 
> amount of time spent building the query is typically much lower than the 
> amount of time spent rewriting and executing it.  The fact that this re-use 
> is only really useful for index time and that the lifecycle of the analyzer 
> is therefore very closely tied to the lifecycle of its associated IndexWriter 
> makes me think that we should think about moving the re-use strategies into 
> IndexWriter itself.  One option would be to have token streams be constructed 
> once per DocumentsWriterPerThread, which would lose some re-use but would 
> mean we could avoid ThreadLocals entirely.
>
> Any thoughts?
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Analyzer lifecycles

2021-06-08 Thread Alan Woodward
Hi all,

I’ve been on holiday and away from a keyboard for a week, so that means I of 
course spent my time thinking about lucene Analyzers and specifically their 
ReuseStrategies…

Building a TokenStream can be quite a heavy operation, and so we try and reuse 
already-constructed token streams as much as possible.  This is particularly 
important at index time, as having to create lots and lots of very short-lived 
token streams for documents with many short text fields could mean that we 
spend longer building these objects than we do pulling data from them.  To help 
support this, lucene Analyzers have a ReuseStrategy, which defaults to storing 
a map of fields to token streams in a ThreadLocal object.  Because ThreadLocals 
can behave badly when it comes to containers that have large thread pools, we 
use a special CloseableThreadLocal class that can null out its contents once 
the Analyzer is done with, and this leads to Analyzer itself being Closeable.  
This makes extending analyzers more complicated, as delegating wrappers need to 
ensure that they don’t end up sharing token streams with their delegates.

It’s common to use the same analyzer for indexing and for parsing user queries. 
 At query time, reusing token streams is a lot less important - the amount of 
time spent building the query is typically much lower than the amount of time 
spent rewriting and executing it.  The fact that this re-use is only really 
useful for index time and that the lifecycle of the analyzer is therefore very 
closely tied to the lifecycle of its associated IndexWriter makes me think that 
we should think about moving the re-use strategies into IndexWriter itself.  
One option would be to have token streams be constructed once per 
DocumentsWriterPerThread, which would lose some re-use but would mean we could 
avoid ThreadLocals entirely.

Any thoughts?
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org