Sounds good; thanks again. I'll see if I can cook something up this week. As I thought about this a little further, I think I'll need to avoid a default "skipSlowly" in DataInput in order to see the same benefits I observed with my local change, since it requires doing away with allocating a byte[] for skipping in each DataInput. So if DataInput still has a default "slow skip" implementation, that doesn't quite solve the accumulation of byte array garbage (although there would be other benefits of course). After looking at the code a little more though, it doesn't seem too difficult to create a "proper" skipBytes implementation in all the necessary places.
Thanks again! Cheers, -Greg On Wed, Feb 17, 2021 at 5:15 PM Robert Muir <rcm...@gmail.com> wrote: > Sure, we can always create followups, but we can start with trying to make > the implementations efficient. > > We'd probably want to create more followup issues later anyway, e.g. to > address any TODOs, and ultimately to fix the API: it is silly to have > seek(long) and skipBytes(long) that do exactly the same thing. > > It is just especially egregious that one of these has a default > implementation that will allocate buffers and sequentially read+throw away > up to 2^63-1 bytes, so let's fix that first. > > On Wed, Feb 17, 2021 at 7:58 PM Greg Miller <gsmil...@gmail.com> wrote: > >> Fair points. I'm going to see if I can carve out some time to take up >> your suggested approach. Would you suggest using LUCENE-9480 >> <https://issues.apache.org/jira/browse/LUCENE-9480> to report back on >> any progress made? There are a few different ideas captured there, and it >> sounds like the leading suggestion is to collapse the ideas of DataInput >> and IndexInput, so I'm not sure if that's the best place to track the >> suggested approach we've discussed here or not. >> >> Thanks again for the discussion! >> >> Cheers, >> -Greg >> >> On Wed, Feb 17, 2021 at 4:09 PM Robert Muir <rcm...@gmail.com> wrote: >> >>> >>> >>> On Wed, Feb 17, 2021 at 6:53 PM Greg Miller <gsmil...@gmail.com> wrote: >>> >>>> >>>> Right, I am looking at the code but maybe we're talking about two >>>> different things, so let me clarify. I agree that there is no concurrency >>>> issue with the current code and I apologise if my point was confusing. The >>>> reason skipBytes was made an instance variable as opposed to a static one >>>> was to *avoid *creating a concurrency issue (which certainly would >>>> exist if it had been made static). Making it an instance variable is >>>> wasteful for GC though, no? My suggestion of moving to a threadlocal hits a >>>> "happy medium" where we're not allocating these silly buffers for each >>>> DataInput instance but making sure each thread has a separate one. Does >>>> this make more sense now? >>>> >>>> >>> Adding a threadlocal isn't a happy medium here. The first thing I see is >>> thread churn issues (apps that use non-fixed threadpools). See javadocs for >>> ClosableThreadLocal for more information. We can't even use that >>> CloseableThreadLocal hack here, because nobody calls close() on clones of >>> IndexInputs. So threadlocal would seriously make matters worse for some >>> apps using the library, all for something that should be a "+=" :) >>> >>> I'm not trying to suggest perfection, just start by deprecating the slow >>> impl, make it abstract and "hoist" the responsibility of implementation >>> upwards to subclasses that are better prepared to handle them. For the >>> majority use case (e.g. IndexInput), your buffer trivially goes away since >>> you implemented it with seek(). The ones that are left can remain slow, >>> that's fine, you speed up 90% easily, and we can see what is needed to fix >>> the leftovers. But I really don't think these will be difficult either, >>> e.g. for ChecksumIndexInput we can probably leave it abstract and implement >>> the skipping directly on BufferedChecksumIndexInput (it already has a >>> buffer, so it should be able to use that one, and avoid 2 buffers like >>> today). >>> >>