Hi Alexander, Thank you for the patch. I applied v10 to a clean checkout, built it, and ran the regression suite — all tests pass. Nice work.
I have attached a review report below. Detailed gcov coverage in HTML format is included in coverage.tgz. Unicode Normalization Performance Patch Review =============================================== Patch: Improve the performance of Unicode Normalization Forms Commitfest: https://commitfest.postgresql.org/patch/5802 Review Methodology: Quality assessment focused on key code paths and coverage analysis. Code coverage measured using gcov (modified lines only), supplemented by normalization-check (NormalizationTest.txt). Limitations: - Not a security audit or formal verification TABLE OF CONTENTS ----------------- 1. Executive Summary 2. Issues Found 2.1 Critical / Major 2.2 Minor (Review Needed) 3. Test Coverage 3.1 Covered Areas 3.2 Untested Items 4. Code Coverage Analysis 4.1 Overall Coverage 4.2 Uncovered Code Risk Assessment 5. Commit Analysis 6. Recommendations 7. Conclusion 1. EXECUTIVE SUMMARY -------------------- Overall assessment: GOOD The patch replaces two separate lookup mechanisms (bsearch for FRONTEND, perfect hash for BACKEND) with a single two-level sparse array. This unifies code paths that diverged in 2020 for pragmatic reasons, and shrinks the generated table from ~19K to ~9K lines. The approach is well-established in Unicode processing libraries (CPython, ICU, Tcl). No critical issues found. Performance was measured both by the author and independently by Jeff Davis using ICU v76.1 test data (100 iterations, ICU collation test strings): Without patch: NFD 9.1s NFC 14.5s With patch: NFD 1.6s NFC 3.0s ICU v76.1: NFD 1.9s NFC 1.1s Thread: https://postgr.es/m/[email protected] Rating by Area: - Code Quality: Good (clean unification, no style violations) - Test Coverage: Adequate (95.4% modified-line coverage) - Documentation: N/A (no user-facing documentation changes) - Build/Regress: Pass 2. ISSUES FOUND --------------- 2.1 Critical / Major None. 2.2 Minor (Review Needed) #1 [Design] GenerateSparseArray.pm - zero value dual semantics Index value 0 serves as both "entry not found" and a valid array index. This works today because all callers reserve position 0 as a null entry, but the convention is implicit and unenforced. A future caller that does not follow it would silently return wrong data. Jeff Davis suggested #define EMPTY 0xFFFF to make the sentinel explicit. The author's response was non-committal. This remains open. #2 [Code] unicode_norm.c:476-480 - OOM path untestable ALLOC() failure in unicode_normalize() returns NULL, but this branch is unreachable in BACKEND (palloc() throws on OOM) and untestable without fault injection. Recommend manual review of all FRONTEND callers to confirm NULL return is handled correctly. 3. TEST COVERAGE ---------------- 3.1 Covered Areas Existing tests in src/test/regress/sql/unicode.sql exercise all four normalization forms (NFC, NFD, NFKC, NFKD), idempotency, empty string, IS NORMALIZED predicate, and error cases. 3.2 Untested Items The following lines were uncovered before this review. Sample tests have been added to unicode.sql and verified via pg-regress. Line Code path ------------------------------------------------------------------------------- unicode_norm.c:337 canonical_reorder() in Hangul decomposition path unicode_norm.c:414 canonical_reorder() in non-Hangul decomposition path unicode_norm.c:476-480 ALLOC() failure — FRONTEND OOM only, untestable unicode_norm.c:550 FREE(classes) when input decomposes to >512 chars Tests added for lines 337, 414, 550: -- Multiple combining marks before a Hangul syllable trigger canonical -- reorder before the Hangul fast path writes its jamo decomposition. SELECT normalize(U&'\0308\0301\AC00', NFD) = U&'\0308\0301\1100\1161' COLLATE "C" AS test_hangul_reorder; -- Multiple combining marks before a precomposed character trigger -- canonical reorder when the starter from its decomposition is written. SELECT normalize(U&'\0308\0301\00E0', NFD) = U&'\0308\0301\0061\0300' COLLATE "C" AS test_decomp_reorder; -- class_buf is uint8[512]; repeat(U+00E4, 300) decomposes to 600 chars, -- forcing heap allocation for the combining class buffer. SELECT length(normalize(repeat(U&'\00E4', 300), NFD)) = 600 AS test_large_decomp; 4. CODE COVERAGE ANALYSIS ------------------------- 4.1 Overall Coverage (modified lines only) 95.4% (146 / 153 lines) — measured on unicode_norm.c. GenerateSparseArray.pm, PrettyLine.pm, generate-unicode_norm_table.pl, and unicode_norm_table.h are build-time artifacts; runtime coverage not applicable. 4.2 Uncovered Code Risk Assessment The 7 uncovered lines fall into two groups: Lines 337, 414, 550 — reachable but required edge-case inputs. Tests have been added (see section 3.2). Lines 476-480 — OOM path, unreachable in BACKEND. The sole FRONTEND production caller (saslprep.c) handles NULL correctly. No uncovered code presents a functional or security risk. 5. COMMIT ANALYSIS ------------------ 4 commits (2 preparatory + 2 core): Commit Area Files +/- Key Content ------------------------------------------------------------------------------- 1 Tools 1 +132/0 PrettyLine.pm - line formatter 2 Tools 1 +374/0 GenerateSparseArray.pm - sparse array generator 3 Core 7 +6485/-12534 Sparse array tables + lookup, remove perfect hash 4 Refactor+Perf 3 +3746/-3223 unicode_normalize() rewrite, compat decomp separation Net change: -5,020 lines. unicode_norm_hashfunc.h deleted (-3,025 lines); unicode_norm_table.h compacted from ~19K to ~9K lines. 6. RECOMMENDATIONS ------------------ 1. Add tests for lines 337, 414, and 550 to the regression suite. Sample SQL provided in section 3.2 — all three verified correct. 2. Revisit zero value semantics in GenerateSparseArray.pm (see section 2.2 #1). The current code is correct, but the "0 means not found" convention is implicit. Consider #define EMPTY 0xFFFF before adding more callers. 3. Review FRONTEND callers of unicode_normalize() to confirm NULL return is handled correctly (see section 2.2 #2). 7. CONCLUSION ------------- The patch is a well-scoped improvement that closes a gap left open in 2020: the FRONTEND path was kept on bsearch to avoid inflating libpq, while BACKEND used a perfect hash. The sparse array is smaller than the hash table yet faster than bsearch, making it viable for both contexts. The only item worth discussing before commit is whether to include the three new regression tests from section 3.2. Everything else is clean.
coverage.tgz
Description: GNU Zip compressed data
