On Tue, 20 May 2025 23:01:08 GMT, John R Rose <jr...@openjdk.org> wrote:
>> This change adds smaller "fast-forward" table to accelerate random-access >> queries within a larger stream. >> >> I am not against this change, but I think it should be refactored (either >> now or later) into a library that can be used with other streamy data. I'm >> thinking of debug-info, dependency lists, line number tables, reloc-info, or >> maybe PC descs as future clients of such a thing. >> >> It could be a query accelerator (index), for any kind of "streamy data" that >> can get long, and needs occasional random access. >> >> It is good to switch to binary search for big tables, while preserving the >> speed of linear search for small ones. A similar technique can also be seen >> in `PcDescContainer::find_pc_desc_internal`. Both binary-search algorithms >> are complicated and difficult to validate. If we had a suitable library, >> we could tune it and test it carefully, and use it to simplify field >> streams, PC descs, and whatever else we need to work with. >> >> BTW, it's sometimes useful to add a front-side cache for such searches. >> This is a bookmark of the last query point. This helps with queries which >> are correlated; you don't re-search from scratch on every step. This is the >> function of `last_pc_desc` in the PC desc query; it can accelerate the >> binary search. >> >> The fast-forward table here uses an ad hoc var-int scheme which is delicate >> and possibly buggy. >> >> One way to do without the specialized table is to inject synchronization >> markers into the streamy data itself. Note that the null byte is never used >> to encode UNSIGNED5 data. Therefore, if you have a stream of UNSIGNED5 >> tokens, you can add boundaries encoded null bytes. In the present >> application, you could perform binary search on the raw stream pointer, with >> each probe first re-synchronizing to the first null byte, and then decoding >> with a short sprint of linear access. The placement of null bytes is a >> tuning tactic: More null bytes add footprint but decrease the length of the >> short sprints. >> >> I'm not asking for any changes at this point because I have not read the PR >> carefully enough. And I don't intend to hold up reviews, but I do want us >> to put refactoring on our roadmap. > >> @rose00 While refactoring might be useful, this PR is trying to address a >> performance regression (in fact by improving the scenario significantly), >> and I'd prefer that to land in 25 > > It will land when it's ready. It's not quite ready yet. I'm not asking for > reusability, but I am saying that doing a one-off index hack here is > exchanging one kind of technical debt (maintenance complexity) for another > (performance pothole). If we hustle to make 25, we need a plan to fix it > after 25. > >> rather than to delay just for the sake of better 'reusability'. > > Yes, 'reusability'. That's how big systems are maintained. > >> And TBH I don't consider binary search algorithm too difficult to validate. > > TBH I do. I've seen enough OBOEs for one lifetime. I don't consider even > that simple an algorithm to be fully tested unless it is either unit tested > (gtest) or stress tested (threshold set small), and preferably both. Testing > it for just a few classes that have many fields is not full coverage. > (Normally for this kind of thing, your hard-coded constant 16 would be made a > debug-only globals, and we'd stress-test with it set to 1 or 2.) Writing a > unit test requires enough factorization to write the unit test. Which also > tends towards virtuous reusability. > >> While a simple cache might be useful, I don't have a good example to >> validate its usefulness. And turning a read-only scenario into mutating >> scenario can turn out badly for multithreaded code, as in >> https://bugs.openjdk.org/browse/JDK-8180450 . > > No worries about this one; I'm not asking for such a thing, just noting in > passing that it sometimes helps. Probably not in this case, since you didn't > resort the original array, and the usual sequential encounter order is > direct, not requiring a cache. If we were re-engineering PC desc stuff on > top of CompressedStream, we'd want such a cache. Maybe some day. Some > one-element caches have the problem you point out, but if you look at the > one-element cache for PC descs you'll see one with no such reported problems. > (IIRC I wrote both.) > >> While you suggest 0 byte as synchronization markers, I've tried to >> synchronize the stream in #24713 and experiments have shown that while it >> improves things, it was not sufficient in my case. That's why I've abandoned >> that approach. > > I looked at that code; thanks. That wasn't what I meant. The point of a > null resynch byte is to allow *approximate* entry into the main stream. It > is an optional technique that would work like this: > > ... @rose00 I've applied your suggestion, moving the search table to a separate array. While I perceived keeping related data more together as a pro, this solution has a definite advantage that the implementation details for search table do not leak into Java code at all. Should I turn the `SEARCH_TABLE_THRESHOLD` into global variable as you mention? It might offer an opportunity for fine-tuning. From testing point of view, it would scale the number of classes that have the search table attached; it won't change behaviour. If we're worried about off-by-one errors I would be rather concerned about the decisions on 1/2/3 byte encoding in the search table. > It will land when it's ready. It's not quite ready yet. I'm not asking for > reusability, but I am saying that doing a one-off index hack here is > exchanging one kind of technical debt (maintenance complexity) for another > (performance pothole). If we hustle to make 25, we need a plan to fix it > after 25. I am not trying to rush it, but keep a limited scope. Most importantly, I wanted to avoid making other parts of the codebase to reuse the implementation. From this point, I can imagine abstracting a bit more, if you find that useful, maybe at the cost of some virtual method calls. > It is an optional technique that would work like this Thanks for the explanation, I get it now. But yes, let's try to not complicate it now. There's no definite criteria how much CPU/memory tradeoff can JDK afford - that's also why this PR had several iterations. Initially I was looking for a solution consuming the fewest extra bytes, disregarding the expense of 'maintainability'. ------------- PR Comment: https://git.openjdk.org/jdk/pull/24847#issuecomment-2897709062