Hi Ilia, Thanks for the review.
> On Feb 17, 2026, at 00:16, Ilia Evdokimov <[email protected]> > wrote: > > I think it would be beneficial to split it into two logically independent > parts. Right now the patch introduces two different changes at once: > 1. Replacing the original shift-based singleton handling with the cursor > based eviction mechanism. > 2. Adding the hash-based lookup. > Splitting the patch would make review easier because I can first focus purely > on the algorithmic transformation (shift -> cursor) and verify semantic > equivalence. I split v5 accordingly. The first patch changes the singleton handling from shifting to a cursor-based eviction scheme, and the second patch adds the hash lookup. > On Feb 26, 2026, at 19:46, Ilia Evdokimov <[email protected]> > wrote: > > Have you benchmarked this change except in first message in this thread? Yes. I reran the same bench scripts from the first post, unchanged. For this rerun I kept the same comparison setup as in that first message: one run on the pre-change path with no hash lookup, and one run with hash lookup forced on for all statistics targets by setting the threshold to 0. Across the full bench run, the hash-enabled variant shows a 3.963x geomean speedup. For statistics_target >= 100, all measured points improved, with a 5.578x geomean speedup in that region. A few representative rows are: - bench_aclitem.x_unique, target 10000: 204740.298 ms -> 941.056 ms (217.564x) - bench_xid.x_unique, target 10000: 138083.532 ms -> 984.214 ms (140.298x) At lower targets, the effect is smaller and mixed. For example: - bench_xid.x_match, target 50: 42.245 ms -> 43.690 ms (0.967x) > I would probably not rely too heavily on EQJOINSEL_MCV_HASH_THRESHOLD as a > direct reference point here. That threshold is somewhat heuristic as well, > and the cost trade-offs in eqjoinsel() are not identical to what we have in > compute_distinct_stats(). In particular, here the break-even point will > strongly depend on the data type and value width. For example, with wider > datums (e.g. bytea), the comparison cost increases, and the threshold at > which hashing amortizes its setup cost may shift noticeably. We don't need to > compute the exact optimal threshold - just get a reasonable order. I agree that EQJOINSEL_MCV_HASH_THRESHOLD is not a strong direct reference point here, and I am not treating 100 as something to inherit from eqjoinsel(). I reran the bench mainly to sanity-check the threshold for compute_distinct_stats() itself. In this setup, forcing hashing at all statistics targets gives mixed results below 100, while from 100 upward the effect is consistently positive. So in the attached v5 I kept 100 as a conservative heuristic for this relatively narrow path, rather than as a claim that it is an exact or universally optimal cutoff. That seemed enough to establish a reasonable order for the threshold, even if it is not meant as an exact break-even point for every type. > While reviewing the patch more closely, I noticed that > compute_distinct_stats() is only used for types where we have =, != but not > <. In practice, most common scalar types go through compute_scalar_stats() > instead. > > That makes me wonder how often this optimization would actually trigger in > real workloads. Since compute_scalar_stats() is the more common path, there's > chance that the hash-table based improvement in compute_distinct_stats() may > not provide a noticeable overall benefit. I agree that this is a relatively narrow optimization. In core, the main direct built-in cases on this path are xid, cid, and aclitem, plus some derived cases over equality-only types. So the scope here is limited, but when compute_distinct_stats() is exercised, especially with higher statistics targets, the improvement is substantial. I am attaching v5 as a two-patch series, together with bench-compare-result.out containing the full no-hash vs all-hash comparison. -- Best regards, Chengpeng Yan
v5-0001-ANALYZE-use-cursor-eviction-for-distinct-value-tr.patch
Description: v5-0001-ANALYZE-use-cursor-eviction-for-distinct-value-tr.patch
v5-0002-ANALYZE-hash-accelerate-MCV-tracking-for-equality.patch
Description: v5-0002-ANALYZE-hash-accelerate-MCV-tracking-for-equality.patch
bench-compare-result.out
Description: bench-compare-result.out
