Re: RFR: 8352075: Perf regression accessing fields [v5]

Radim Vansa Wed, 21 May 2025 05:08:40 -0700

On Tue, 20 May 2025 23:01:08 GMT, John R Rose <jr...@openjdk.org> wrote:


>> This change adds smaller "fast-forward" table to accelerate random-access 
>> queries within a larger stream.
>> 
>> I am not against this change, but I think it should be refactored (either 
>> now or later) into a library that can be used with other streamy data.  I'm 
>> thinking of debug-info, dependency lists, line number tables, reloc-info, or 
>> maybe PC descs as future clients of such a thing.
>> 
>> It could be a query accelerator (index), for any kind of "streamy data" that 
>> can get long, and needs occasional random access.
>> 
>> It is good to switch to binary search for big tables, while preserving the 
>> speed of linear search for small ones.  A similar technique can also be seen 
>> in `PcDescContainer::find_pc_desc_internal`.  Both binary-search algorithms 
>> are complicated and difficult to validate.   If we had a suitable library, 
>> we could tune it and test it carefully, and use it to simplify field 
>> streams, PC descs, and whatever else we need to work with.
>> 
>> BTW, it's sometimes useful to add a front-side cache for such searches.  
>> This is a bookmark of the last query point.  This helps with queries which 
>> are correlated; you don't re-search from scratch on every step.  This is the 
>> function of `last_pc_desc` in the PC desc query; it can accelerate the 
>> binary search.
>> 
>> The fast-forward table here uses an ad hoc var-int scheme which is delicate 
>> and possibly buggy.
>> 
>> One way to do without the specialized table is to inject synchronization 
>> markers into the streamy data itself.  Note that the null byte is never used 
>> to encode UNSIGNED5 data.  Therefore, if you have a stream of UNSIGNED5 
>> tokens, you can add boundaries encoded null bytes.   In the present 
>> application, you could perform binary search on the raw stream pointer, with 
>> each probe first re-synchronizing to the first null byte, and then decoding 
>> with a short sprint of linear access.  The placement of null bytes is a 
>> tuning tactic:  More null bytes add footprint but decrease the length of the 
>> short sprints.
>> 
>> I'm not asking for any changes at this point because I have not read the PR 
>> carefully enough.  And I don't intend to hold up reviews, but I do want us 
>> to put refactoring on our roadmap.
>
>> @rose00 While refactoring might be useful, this PR is trying to address a 
>> performance regression (in fact by improving the scenario significantly), 
>> and I'd prefer that to land in 25
> 
> It will land when it's ready.  It's not quite ready yet.  I'm not asking for 
> reusability, but I am saying that doing a one-off index hack here is 
> exchanging one kind of technical debt (maintenance complexity) for another 
> (performance pothole).  If we hustle to make 25, we need a plan to fix it 
> after 25.
> 
>> rather than to delay just for the sake of better 'reusability'.
> 
> Yes, 'reusability'.  That's how big systems are maintained.
> 
>> And TBH I don't consider binary search algorithm too difficult to validate.
> 
> TBH I do.  I've seen enough OBOEs for one lifetime.  I don't consider even 
> that simple an algorithm to be fully tested unless it is either unit tested 
> (gtest) or stress tested (threshold set small), and preferably both.  Testing 
> it for just a few classes that have many fields is not full coverage.  
> (Normally for this kind of thing, your hard-coded constant 16 would be made a 
> debug-only globals, and we'd stress-test with it set to 1 or 2.)   Writing a 
> unit test requires enough factorization to write the unit test.  Which also 
> tends towards virtuous reusability.
> 
>> While a simple cache might be useful, I don't have a good example to 
>> validate its usefulness. And turning a read-only scenario into mutating 
>> scenario can turn out badly for multithreaded code, as in 
>> https://bugs.openjdk.org/browse/JDK-8180450 .
> 
> No worries about this one; I'm not asking for such a thing, just noting in 
> passing that it sometimes helps.  Probably not in this case, since you didn't 
> resort the original array, and the usual sequential encounter order is 
> direct, not requiring a cache.  If we were re-engineering PC desc stuff on 
> top of CompressedStream, we'd want such a cache.  Maybe some day.  Some 
> one-element caches have the problem you point out, but if you look at the 
> one-element cache for PC descs you'll see one with no such reported problems. 
>  (IIRC I wrote both.)
> 
>> While you suggest 0 byte as synchronization markers, I've tried to 
>> synchronize the stream in #24713 and experiments have shown that while it 
>> improves things, it was not sufficient in my case. That's why I've abandoned 
>> that approach.
> 
> I looked at that code; thanks.  That wasn't what I meant.  The point of a 
> null resynch byte is to allow *approximate* entry into the main stream.  It 
> is an optional technique that would work like this:
> 
>  ...

@rose00 I've applied your suggestion, moving the search table to a separate 
array. While I perceived keeping related data more together as a pro, this 
solution has a definite advantage that the implementation details for search 
table do not leak into Java code at all.

Should I turn the `SEARCH_TABLE_THRESHOLD` into global variable as you mention? 
It might offer an opportunity for fine-tuning. From testing point of view, it 
would scale the number of classes that have the search table attached; it won't 
change behaviour. If we're worried about off-by-one errors I would be rather 
concerned about the decisions on 1/2/3 byte encoding in the search table.

> It will land when it's ready. It's not quite ready yet. I'm not asking for 
> reusability, but I am saying that doing a one-off index hack here is 
> exchanging one kind of technical debt (maintenance complexity) for another 
> (performance pothole). If we hustle to make 25, we need a plan to fix it 
> after 25.

I am not trying to rush it, but keep a limited scope. Most importantly, I 
wanted to avoid making other parts of the codebase to reuse the implementation. 
From this point, I can imagine abstracting a bit more, if you find that useful, 
maybe at the cost of some virtual method calls.

> It is an optional technique that would work like this

Thanks for the explanation, I get it now. But yes, let's try to not complicate 
it now. There's no definite criteria how much CPU/memory tradeoff can JDK 
afford - that's also why this PR had several iterations. Initially I was 
looking for a solution consuming the fewest extra bytes, disregarding the 
expense of 'maintainability'.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2897709062

Re: RFR: 8352075: Perf regression accessing fields [v5]

Reply via email to