Re: RFR: 8352075: Perf regression accessing fields [v5]

John R Rose Tue, 20 May 2025 16:04:24 -0700

On Fri, 16 May 2025 23:51:39 GMT, John R Rose <jr...@openjdk.org> wrote:


>> Radim Vansa has updated the pull request incrementally with five additional 
>> commits since the last revision:
>> 
>>  - Revert change in array.hpp
>>  - Revert changes in VerifyRawIndexesTest
>>  - Improve FieldInfo::print
>>  - Load constant for SORTED_FIELD_TABLE_THRESHOLD
>>  - Replace JumpTable with SortedFieldTable
>
> This change adds smaller "fast-forward" table to accelerate random-access 
> queries within a larger stream.
> 
> I am not against this change, but I think it should be refactored (either now 
> or later) into a library that can be used with other streamy data.  I'm 
> thinking of debug-info, dependency lists, line number tables, reloc-info, or 
> maybe PC descs as future clients of such a thing.
> 
> It could be a query accelerator (index), for any kind of "streamy data" that 
> can get long, and needs occasional random access.
> 
> It is good to switch to binary search for big tables, while preserving the 
> speed of linear search for small ones.  A similar technique can also be seen 
> in `PcDescContainer::find_pc_desc_internal`.  Both binary-search algorithms 
> are complicated and difficult to validate.   If we had a suitable library, we 
> could tune it and test it carefully, and use it to simplify field streams, PC 
> descs, and whatever else we need to work with.
> 
> BTW, it's sometimes useful to add a front-side cache for such searches.  This 
> is a bookmark of the last query point.  This helps with queries which are 
> correlated; you don't re-search from scratch on every step.  This is the 
> function of `last_pc_desc` in the PC desc query; it can accelerate the binary 
> search.
> 
> The fast-forward table here uses an ad hoc var-int scheme which is delicate 
> and possibly buggy.
> 
> One way to do without the specialized table is to inject synchronization 
> markers into the streamy data itself.  Note that the null byte is never used 
> to encode UNSIGNED5 data.  Therefore, if you have a stream of UNSIGNED5 
> tokens, you can add boundaries encoded null bytes.   In the present 
> application, you could perform binary search on the raw stream pointer, with 
> each probe first re-synchronizing to the first null byte, and then decoding 
> with a short sprint of linear access.  The placement of null bytes is a 
> tuning tactic:  More null bytes add footprint but decrease the length of the 
> short sprints.
> 
> I'm not asking for any changes at this point because I have not read the PR 
> carefully enough.  And I don't intend to hold up reviews, but I do want us to 
> put refactoring on our roadmap.

> @rose00 While refactoring might be useful, this PR is trying to address a 
> performance regression (in fact by improving the scenario significantly), and 
> I'd prefer that to land in 25

It will land when it's ready.  It's not quite ready yet.  I'm not asking for 
reusability, but I am saying that doing a one-off index hack here is exchanging 
one kind of technical debt (maintenance complexity) for another (performance 
pothole).  If we hustle to make 25, we need a plan to fix it after 25.

> rather than to delay just for the sake of better 'reusability'.

Yes, 'reusability'.  That's how big systems are maintained.

> And TBH I don't consider binary search algorithm too difficult to validate.

TBH I do.  I've seen enough OBOEs for one lifetime.  I don't consider even that 
simple an algorithm to be fully tested unless it is either unit tested (gtest) 
or stress tested (threshold set small), and preferably both.  Testing it for 
just a few classes that have many fields is not full coverage.  (Normally for 
this kind of thing, your hard-coded constant 16 would be made a debug-only 
globals, and we'd stress-test with it set to 1 or 2.)   Writing a unit test 
requires enough factorization to write the unit test.  Which also tends towards 
virtuous reusability.

> While a simple cache might be useful, I don't have a good example to validate 
> its usefulness. And turning a read-only scenario into mutating scenario can 
> turn out badly for multithreaded code, as in 
> https://bugs.openjdk.org/browse/JDK-8180450 .

No worries about this one; I'm not asking for such a thing, just noting in 
passing that it sometimes helps.  Probably not in this case, since you didn't 
resort the original array, and the usual sequential encounter order is direct, 
not requiring a cache.  If we were re-engineering PC desc stuff on top of 
CompressedStream, we'd want such a cache.  Maybe some day.  Some one-element 
caches have the problem you point out, but if you look at the one-element cache 
for PC descs you'll see one with no such reported problems.  (IIRC I wrote 
both.)

> While you suggest 0 byte as synchronization markers, I've tried to 
> synchronize the stream in #24713 and experiments have shown that while it 
> improves things, it was not sufficient in my case. That's why I've abandoned 
> that approach.

I looked at that code; thanks.  That wasn't what I meant.  The point of a null 
resynch byte is to allow *approximate* entry into the main stream.  It is an 
optional technique that would work like this:

 - The secondary index stores approximate address ranges, not exact byte 
pointers, to the data in the primary stream.
 - For example, for compactness, the secondary index might scale down by (say) 
2 cache line sizes (discarding 7 lower bits).
 - As a tradeoff, this makes the secondary index smaller (discard 8 bits per 
entry).
 - But, it means you have a search a bit when you jump into the primary stream.
 - Since compressed data is not self-synchronizing, you have to first search 
for a null byte (after or maybe before).
 - The search starts just after a found null byte.
 - The length of the search is bounded by the scale factor (128 bytes, say).
 - The stream logic that traverses the CompressedStream tests for zero bytes 
and skips them (ignores them).

The thing to try first, though, is to store exact offsets into the secondary 
index.  Then you don't need to search, nor resynchronize, nor plant zero bytes 
in the stream.  Much simpler.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2896016101

Re: RFR: 8352075: Perf regression accessing fields [v5]

Reply via email to