Re: GC improvement when skipping bytes in DataInput

Robert Muir Wed, 17 Feb 2021 15:34:24 -0800

There is zero concurrency issue. I think you are reading discussion about
old patches that didn't get committed, ignore that and look at the code.
Look at DataInput javadoc: "DataInput may only be used from one thread,
because it is not thread safe (it keeps internal state like file position)."
Look at IndexInput javadoc: "IndexInput may only be used from one thread,
because it is not thread safe (it keeps internal state like file position)."
This applies to ChecksumIndexInput, too, it is a subclass.


Let's try to avoid adding more complex logic to the situation. Base classes
like DataInput shouldn't have "slow" methods that do things like read bytes
and copy them around just to increment a position pointer... that is not
good. It is supposed to just have some simple decoding helpers (like vint
decode). So it is at the wrong "level" to have such a method, as it can't
do it in an efficient way.

A good start would be to rename DataInput.skipBytes() to a deprecated
DataInput.skipBytesSlowly() and add a new abstract DataInput.skipBytes().
Now you just have to implement skipBytes() with all the subclasses, but you
can always start with skipBytesSlowly() // TODO: fix this, so it allows
incremental progress. For IndexInput, you can make skipBytes() just call
seek(), that is an easy win. ByteArrayDataInput is already good to go,
perhaps it is the only one with a correct implementation :)


On Wed, Feb 17, 2021 at 5:10 PM Greg Miller <[email protected]> wrote:

> Thanks for pointing me to the existing issue! I think I generally agree
> that no threadlocal would be needed if the functionality of skipBytes() and
> seek() were collapsed as per the issue you pointed me to. In the current
> state though, concurrency causes problems for delegating implementations of
> DataInput, specifically ChecksumIndexInput (as detailed in LUCENE-5583
> <https://issues.apache.org/jira/browse/LUCENE-5583>).
>
> I'll spend a little more time thinking about LUCENE-9480 and comment
> there. Seems like a better approach to eliminate the need for reading in
> bytes just to skip, assuming it can be made to play nice with checksums,
> etc. Thanks again!
>
> Cheers,
> -Greg
>
> On Wed, Feb 17, 2021 at 1:47 PM Robert Muir <[email protected]> wrote:
>
>> See this already-open issue:
>> https://issues.apache.org/jira/browse/LUCENE-9480
>>
>> No threadlocals are necessary. DataInput/IndexInput are already intended
>> for use by one thread.
>>
>> For e.g. mmapdirectory this should be "+=". no buffer is required.
>>
>> On Wed, Feb 17, 2021 at 4:15 PM Greg Miller <[email protected]> wrote:
>>
>>> Hi folks-
>>>
>>> I work on a Lucene-based search system and we recently added Java Flight
>>> Recorder to our benchmark tooling. When looking through results, we found
>>> DataInput#skipBytes() to be a top contributor to garbage creation. We're
>>> using Lucene84SkipReader and always skipping over Impacts in our use-case.
>>> At first glance, it appeared pretty obvious that creating new instances of
>>> the skipBuffer byte[] for each instance of DataInput was the culprit.
>>>
>>> It looks like alternatives were discussed originally in LUCENE-5583
>>> <https://issues.amazon.com/issues/LUCENE-4947>, one of which being a
>>> thread-local implementation of the skip buffer (since it can't be a static
>>> field without breaking delegating subclasses, like ChecksumIndexInput). At
>>> the time, a thread-local was advised against
>>> <https://issues.apache.org/jira/browse/LUCENE-5583?focusedCommentId=13964258&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13964258>
>>>  by
>>> Uwe due to GC expense, but in our benchmarks, bringing in a thread-local
>>> implementation reduced overall GC time by ~7%.
>>>
>>> I'd like to revisit this implementation decision and discuss ways in
>>> which we can reduce this unnecessary garbage creation. It seems like moving
>>> to a thread-local implementation is a win here, but I'd love to hear more
>>> thoughts or alternative suggestions from the group. I'm new to this
>>> community, so I'm not sure the best way to proceed. Should I open a Jira
>>> issue as a next step? Thanks in advance!
>>>
>>> Cheers,
>>> -Greg
>>>
>>

Re: GC improvement when skipping bytes in DataInput

Reply via email to