Sounds good; thanks again. I'll see if I can cook something up this week.
As I thought about this a little further, I think I'll need to avoid a
default "skipSlowly" in DataInput in order to see the same benefits I
observed with my local change, since it requires doing away with allocating
a byte[] for skipping in each DataInput. So if DataInput still has a
default "slow skip" implementation, that doesn't quite solve the
accumulation of byte array garbage (although there would be other benefits
of course). After looking at the code a little more though, it doesn't seem
too difficult to create a "proper" skipBytes implementation in all the
necessary places.

Thanks again!

Cheers,
-Greg

On Wed, Feb 17, 2021 at 5:15 PM Robert Muir <rcm...@gmail.com> wrote:

> Sure, we can always create followups, but we can start with trying to make
> the implementations efficient.
>
> We'd probably want to create more followup issues later anyway, e.g. to
> address any TODOs, and ultimately to fix the API: it is silly to have
> seek(long) and skipBytes(long) that do exactly the same thing.
>
> It is just especially egregious that one of these has a default
> implementation that will allocate buffers and sequentially read+throw away
> up to 2^63-1 bytes, so let's fix that first.
>
> On Wed, Feb 17, 2021 at 7:58 PM Greg Miller <gsmil...@gmail.com> wrote:
>
>> Fair points. I'm going to see if I can carve out some time to take up
>> your suggested approach. Would you suggest using LUCENE-9480
>> <https://issues.apache.org/jira/browse/LUCENE-9480> to report back on
>> any progress made? There are a few different ideas captured there, and it
>> sounds like the leading suggestion is to collapse the ideas of DataInput
>> and IndexInput, so I'm not sure if that's the best place to track the
>> suggested approach we've discussed here or not.
>>
>> Thanks again for the discussion!
>>
>> Cheers,
>> -Greg
>>
>> On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcm...@gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmil...@gmail.com> wrote:
>>>
>>>>
>>>> Right, I am looking at the code but maybe we're talking about two
>>>> different things, so let me clarify. I agree that there is no concurrency
>>>> issue with the current code and I apologise if my point was confusing. The
>>>> reason skipBytes was made an instance variable as opposed to a static one
>>>> was to *avoid *creating a concurrency issue (which certainly would
>>>> exist if it had been made static). Making it an instance variable is
>>>> wasteful for GC though, no? My suggestion of moving to a threadlocal hits a
>>>> "happy medium" where we're not allocating these silly buffers for each
>>>> DataInput instance but making sure each thread has a separate one. Does
>>>> this make more sense now?
>>>>
>>>>
>>> Adding a threadlocal isn't a happy medium here. The first thing I see is
>>> thread churn issues (apps that use non-fixed threadpools). See javadocs for
>>> ClosableThreadLocal for more information. We can't even use that
>>> CloseableThreadLocal hack here, because nobody calls close() on clones of
>>> IndexInputs. So threadlocal would seriously make matters worse for some
>>> apps using the library, all for something that should be a "+=" :)
>>>
>>> I'm not trying to suggest perfection, just start by deprecating the slow
>>> impl, make it abstract and "hoist" the responsibility of implementation
>>> upwards to subclasses that are better prepared to handle them. For the
>>> majority use case (e.g. IndexInput), your buffer trivially goes away since
>>> you implemented it with seek(). The ones that are left can remain slow,
>>> that's fine, you speed up 90% easily, and we can see what is needed to fix
>>> the leftovers. But I really don't think these will be difficult either,
>>> e.g. for ChecksumIndexInput we can probably leave it abstract and implement
>>> the skipping directly on BufferedChecksumIndexInput (it already has a
>>> buffer, so it should be able to use that one, and avoid 2 buffers like
>>> today).
>>>
>>

Reply via email to