RE: GC cost of creating String resource description on IndexInput clone

Uwe Schindler Sat, 20 Feb 2021 03:17:08 -0800

Hi,


I was thinking about something similar when working on the Java 17 replacement 
for MMapDirectory. My idea would still be late evaluation, but using a lambda. 
Same way how you can do lazy evaluation in Log4j2 when logging with 
debug/trace/…

 

Instead of having “String resourceDescription“ as a field in each Indexinput, 
we can use “Supplier<String> resourceDescription” and initialize it like: () -> 
foo + bar + parent.toString +…

 

Just because this might be an option to do, I fully agree with Mike Drob and 
Robert Muir: The risk of memory leaks for such a small improvement is too high. 
It also depends on the GC you use. If you use old ConcMarkSweepGC, I agree this 
is an overhead, but nowadays, Lucene should use G1GC.

 

In addition, the strings are not for debugging, they are really useful when 
something goes wrong, e.g. when I/O errors occur.

 

Uwe

 

-----

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: [email protected]

 

From: Robert Muir <[email protected]> 
Sent: Saturday, February 20, 2021 11:56 AM
To: [email protected]
Subject: Re: GC cost of creating String resource description on IndexInput clone

 

The issue is that clone or not, they are both IndexInput.java. So if we go with 
your proposal, then *sometimes* the code will have this reference and *other 
times* it won't and the reference will be null. In that non-clone case, where 
would its resource description (filename) come from?  I predict too many bugs. 

 

Making this part of the code complex would be the wrong tradeoff for 1% 
performance.

 

 

On Fri, Feb 19, 2021 at 4:58 PM Viral Gandhi <[email protected] 
<mailto:[email protected]> > wrote:

Hello everyone,

We recently added Java Flight Recorder (JFR) to our internal benchmarking. 
While looking through top heap allocations from JFR output, we noticed the 
following to be in top 5 contributors to garbage creation.

org.apache.lucene.store.IndexInput#getFullSliceDescription()
  at org.apache.lucene.store.ByteBufferIndexInput#buildSlice()
   at org.apache.lucene.store.ByteBufferIndexInput#slice()
   at org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#slice()
   at org.apache.lucene.store.IndexInput#randomAccessSlice()

It seems that we construct a new String resource description for each 
clone/slice of IndexInput instance. Resource description for a clone/slice 
instance is a String concatenation of resource description from original 
instance and current slice description. Also, clone/slice can happen multiple 
times per query per segment. 

Can we avoid upfront String allocation by late-evaluating IndexInput.toString() 
on the cloned instance? One approach could be to hold a reference of the base 
IndexInput instance in the cloned instance. Then while evaluating 
IndexInput.toString() on a cloned instance, we can construct a resource 
description by concatenating base instance's toString() with current resource 
description. My understanding is that IndexInput.toString() is primarily used 
for debugging purposes hence we can benefit from the late-evaluation. 

With this approach, we are seeing sustainable GC time reduction of ~6% and gain 
of ~1% to red-line QPS (throughput). My intention to start this thread is to 
collect feedback on this approach as well as to discuss any other ideas.

 

Thanks,

Viral Gandhi

RE: GC cost of creating String resource description on IndexInput clone

Reply via email to