Re: RFR: 8352075: Perf regression accessing fields [v5]

Radim Vansa Wed, 21 May 2025 07:45:52 -0700

On Tue, 20 May 2025 23:01:08 GMT, John R Rose <jr...@openjdk.org> wrote:


>> This change adds smaller "fast-forward" table to accelerate random-access 
>> queries within a larger stream.
>> 
>> I am not against this change, but I think it should be refactored (either 
>> now or later) into a library that can be used with other streamy data.  I'm 
>> thinking of debug-info, dependency lists, line number tables, reloc-info, or 
>> maybe PC descs as future clients of such a thing.
>> 
>> It could be a query accelerator (index), for any kind of "streamy data" that 
>> can get long, and needs occasional random access.
>> 
>> It is good to switch to binary search for big tables, while preserving the 
>> speed of linear search for small ones.  A similar technique can also be seen 
>> in `PcDescContainer::find_pc_desc_internal`.  Both binary-search algorithms 
>> are complicated and difficult to validate.   If we had a suitable library, 
>> we could tune it and test it carefully, and use it to simplify field 
>> streams, PC descs, and whatever else we need to work with.
>> 
>> BTW, it's sometimes useful to add a front-side cache for such searches.  
>> This is a bookmark of the last query point.  This helps with queries which 
>> are correlated; you don't re-search from scratch on every step.  This is the 
>> function of `last_pc_desc` in the PC desc query; it can accelerate the 
>> binary search.
>> 
>> The fast-forward table here uses an ad hoc var-int scheme which is delicate 
>> and possibly buggy.
>> 
>> One way to do without the specialized table is to inject synchronization 
>> markers into the streamy data itself.  Note that the null byte is never used 
>> to encode UNSIGNED5 data.  Therefore, if you have a stream of UNSIGNED5 
>> tokens, you can add boundaries encoded null bytes.   In the present 
>> application, you could perform binary search on the raw stream pointer, with 
>> each probe first re-synchronizing to the first null byte, and then decoding 
>> with a short sprint of linear access.  The placement of null bytes is a 
>> tuning tactic:  More null bytes add footprint but decrease the length of the 
>> short sprints.
>> 
>> I'm not asking for any changes at this point because I have not read the PR 
>> carefully enough.  And I don't intend to hold up reviews, but I do want us 
>> to put refactoring on our roadmap.
>
>> @rose00 While refactoring might be useful, this PR is trying to address a 
>> performance regression (in fact by improving the scenario significantly), 
>> and I'd prefer that to land in 25
> 
> It will land when it's ready.  It's not quite ready yet.  I'm not asking for 
> reusability, but I am saying that doing a one-off index hack here is 
> exchanging one kind of technical debt (maintenance complexity) for another 
> (performance pothole).  If we hustle to make 25, we need a plan to fix it 
> after 25.
> 
>> rather than to delay just for the sake of better 'reusability'.
> 
> Yes, 'reusability'.  That's how big systems are maintained.
> 
>> And TBH I don't consider binary search algorithm too difficult to validate.
> 
> TBH I do.  I've seen enough OBOEs for one lifetime.  I don't consider even 
> that simple an algorithm to be fully tested unless it is either unit tested 
> (gtest) or stress tested (threshold set small), and preferably both.  Testing 
> it for just a few classes that have many fields is not full coverage.  
> (Normally for this kind of thing, your hard-coded constant 16 would be made a 
> debug-only globals, and we'd stress-test with it set to 1 or 2.)   Writing a 
> unit test requires enough factorization to write the unit test.  Which also 
> tends towards virtuous reusability.
> 
>> While a simple cache might be useful, I don't have a good example to 
>> validate its usefulness. And turning a read-only scenario into mutating 
>> scenario can turn out badly for multithreaded code, as in 
>> https://bugs.openjdk.org/browse/JDK-8180450 .
> 
> No worries about this one; I'm not asking for such a thing, just noting in 
> passing that it sometimes helps.  Probably not in this case, since you didn't 
> resort the original array, and the usual sequential encounter order is 
> direct, not requiring a cache.  If we were re-engineering PC desc stuff on 
> top of CompressedStream, we'd want such a cache.  Maybe some day.  Some 
> one-element caches have the problem you point out, but if you look at the 
> one-element cache for PC descs you'll see one with no such reported problems. 
>  (IIRC I wrote both.)
> 
>> While you suggest 0 byte as synchronization markers, I've tried to 
>> synchronize the stream in #24713 and experiments have shown that while it 
>> improves things, it was not sufficient in my case. That's why I've abandoned 
>> that approach.
> 
> I looked at that code; thanks.  That wasn't what I meant.  The point of a 
> null resynch byte is to allow *approximate* entry into the main stream.  It 
> is an optional technique that would work like this:
> 
>  ...

@rose00 I've attempted to extract the binary search in 
https://github.com/openjdk/jdk/commit/62a40f6b8aafb4a986add0bdac0425dfaf197d62 
, but I am not convinced this is better.
The `SearchTable::lookup` is static now - I am not sure if I could have `class 
SearchTable: private Array<u1> { ... }` and have that allocated in metaspace...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2898186528

Re: RFR: 8352075: Perf regression accessing fields [v5]

Reply via email to