Re: GC improvement when skipping bytes in DataInput

Robert Muir Wed, 17 Feb 2021 17:15:37 -0800

Sure, we can always create followups, but we can start with trying to make
the implementations efficient.


We'd probably want to create more followup issues later anyway, e.g. to
address any TODOs, and ultimately to fix the API: it is silly to have
seek(long) and skipBytes(long) that do exactly the same thing.

It is just especially egregious that one of these has a default
implementation that will allocate buffers and sequentially read+throw away
up to 2^63-1 bytes, so let's fix that first.

On Wed, Feb 17, 2021 at 7:58 PM Greg Miller <gsmil...@gmail.com> wrote:

> Fair points. I'm going to see if I can carve out some time to take up your
> suggested approach. Would you suggest using LUCENE-9480
> <https://issues.apache.org/jira/browse/LUCENE-9480> to report back on any
> progress made? There are a few different ideas captured there, and it
> sounds like the leading suggestion is to collapse the ideas of DataInput
> and IndexInput, so I'm not sure if that's the best place to track the
> suggested approach we've discussed here or not.
>
> Thanks again for the discussion!
>
> Cheers,
> -Greg
>
> On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcm...@gmail.com> wrote:
>
>>
>>
>> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmil...@gmail.com> wrote:
>>
>>>
>>> Right, I am looking at the code but maybe we're talking about two
>>> different things, so let me clarify. I agree that there is no concurrency
>>> issue with the current code and I apologise if my point was confusing. The
>>> reason skipBytes was made an instance variable as opposed to a static one
>>> was to *avoid *creating a concurrency issue (which certainly would
>>> exist if it had been made static). Making it an instance variable is
>>> wasteful for GC though, no? My suggestion of moving to a threadlocal hits a
>>> "happy medium" where we're not allocating these silly buffers for each
>>> DataInput instance but making sure each thread has a separate one. Does
>>> this make more sense now?
>>>
>>>
>> Adding a threadlocal isn't a happy medium here. The first thing I see is
>> thread churn issues (apps that use non-fixed threadpools). See javadocs for
>> ClosableThreadLocal for more information. We can't even use that
>> CloseableThreadLocal hack here, because nobody calls close() on clones of
>> IndexInputs. So threadlocal would seriously make matters worse for some
>> apps using the library, all for something that should be a "+=" :)
>>
>> I'm not trying to suggest perfection, just start by deprecating the slow
>> impl, make it abstract and "hoist" the responsibility of implementation
>> upwards to subclasses that are better prepared to handle them. For the
>> majority use case (e.g. IndexInput), your buffer trivially goes away since
>> you implemented it with seek(). The ones that are left can remain slow,
>> that's fine, you speed up 90% easily, and we can see what is needed to fix
>> the leftovers. But I really don't think these will be difficult either,
>> e.g. for ChecksumIndexInput we can probably leave it abstract and implement
>> the skipping directly on BufferedChecksumIndexInput (it already has a
>> buffer, so it should be able to use that one, and avoid 2 buffers like
>> today).
>>
>

Re: GC improvement when skipping bytes in DataInput

Reply via email to