Re: Improve the performance of Unicode Normalization Forms.

Henson Choi Tue, 05 May 2026 01:02:56 -0700

Hi Alexander,

Thank you for the patch.  I applied v10 to a clean checkout, built it,
and ran the regression suite — all tests pass.  Nice work.


I have attached a review report below.

Detailed gcov coverage in HTML format is included in coverage.tgz.


Unicode Normalization Performance Patch Review
===============================================


Patch: Improve the performance of Unicode Normalization Forms
Commitfest: https://commitfest.postgresql.org/patch/5802


Review Methodology:
  Quality assessment focused on key code paths and coverage analysis.
  Code coverage measured using gcov (modified lines only), supplemented
  by normalization-check (NormalizationTest.txt).


Limitations:
  - Not a security audit or formal verification


TABLE OF CONTENTS
-----------------


1. Executive Summary
2. Issues Found
   2.1 Critical / Major
   2.2 Minor (Review Needed)
3. Test Coverage
   3.1 Covered Areas
   3.2 Untested Items
4. Code Coverage Analysis
   4.1 Overall Coverage
   4.2 Uncovered Code Risk Assessment
5. Commit Analysis
6. Recommendations
7. Conclusion


1. EXECUTIVE SUMMARY
--------------------


Overall assessment: GOOD


The patch replaces two separate lookup mechanisms (bsearch for FRONTEND,
perfect hash for BACKEND) with a single two-level sparse array.  This
unifies code paths that diverged in 2020 for pragmatic reasons, and
shrinks the generated table from ~19K to ~9K lines.  The approach is
well-established in Unicode processing libraries (CPython, ICU, Tcl).
No critical issues found.

Performance was measured both by the author and independently by Jeff Davis
using ICU v76.1 test data (100 iterations, ICU collation test strings):

  Without patch:  NFD  9.1s  NFC 14.5s
  With patch:     NFD  1.6s  NFC  3.0s
  ICU v76.1:      NFD  1.9s  NFC  1.1s


Thread: https://postgr.es/m/[email protected]

Rating by Area:
- Code Quality:     Good (clean unification, no style violations)
- Test Coverage:    Adequate (95.4% modified-line coverage)
- Documentation:    N/A (no user-facing documentation changes)
- Build/Regress:    Pass


2. ISSUES FOUND
---------------


2.1 Critical / Major


None.


2.2 Minor (Review Needed)


#1 [Design] GenerateSparseArray.pm - zero value dual semantics

   Index value 0 serves as both "entry not found" and a valid array
   index.  This works today because all callers reserve position 0 as a
   null entry, but the convention is implicit and unenforced.  A future
   caller that does not follow it would silently return wrong data.
   Jeff Davis suggested #define EMPTY 0xFFFF to make the sentinel
   explicit.  The author's response was non-committal.  This remains open.


#2 [Code] unicode_norm.c:476-480 - OOM path untestable

   ALLOC() failure in unicode_normalize() returns NULL, but this branch
   is unreachable in BACKEND (palloc() throws on OOM) and untestable
   without fault injection.

   Recommend manual review of all FRONTEND callers to confirm NULL return
   is handled correctly.



3. TEST COVERAGE
----------------


3.1 Covered Areas


Existing tests in src/test/regress/sql/unicode.sql exercise all four
normalization forms (NFC, NFD, NFKC, NFKD), idempotency, empty string,
IS NORMALIZED predicate, and error cases.


3.2 Untested Items


The following lines were uncovered before this review.  Sample tests
have been added to unicode.sql and verified via pg-regress.

Line                    Code path
-------------------------------------------------------------------------------
unicode_norm.c:337      canonical_reorder() in Hangul decomposition path
unicode_norm.c:414      canonical_reorder() in non-Hangul decomposition path
unicode_norm.c:476-480  ALLOC() failure — FRONTEND OOM only, untestable
unicode_norm.c:550      FREE(classes) when input decomposes to >512 chars


Tests added for lines 337, 414, 550:

    -- Multiple combining marks before a Hangul syllable trigger canonical
    -- reorder before the Hangul fast path writes its jamo decomposition.
    SELECT normalize(U&'\0308\0301\AC00', NFD) = U&'\0308\0301\1100\1161'
COLLATE "C" AS test_hangul_reorder;

    -- Multiple combining marks before a precomposed character trigger
    -- canonical reorder when the starter from its decomposition is written.
    SELECT normalize(U&'\0308\0301\00E0', NFD) = U&'\0308\0301\0061\0300'
COLLATE "C" AS test_decomp_reorder;

    -- class_buf is uint8[512]; repeat(U+00E4, 300) decomposes to 600 chars,
    -- forcing heap allocation for the combining class buffer.
    SELECT length(normalize(repeat(U&'\00E4', 300), NFD)) = 600 AS
test_large_decomp;


4. CODE COVERAGE ANALYSIS
-------------------------


4.1 Overall Coverage (modified lines only)


95.4% (146 / 153 lines) — measured on unicode_norm.c.
GenerateSparseArray.pm, PrettyLine.pm, generate-unicode_norm_table.pl,
and unicode_norm_table.h are build-time artifacts; runtime coverage not
applicable.


4.2 Uncovered Code Risk Assessment


The 7 uncovered lines fall into two groups:

Lines 337, 414, 550 — reachable but required edge-case inputs.
Tests have been added (see section 3.2).

Lines 476-480 — OOM path, unreachable in BACKEND.  The sole FRONTEND
production caller (saslprep.c) handles NULL correctly.

No uncovered code presents a functional or security risk.


5. COMMIT ANALYSIS
------------------


4 commits (2 preparatory + 2 core):


Commit  Area          Files   +/-          Key Content
-------------------------------------------------------------------------------
1       Tools         1       +132/0       PrettyLine.pm - line formatter
2       Tools         1       +374/0       GenerateSparseArray.pm - sparse
                                           array generator
3       Core          7       +6485/-12534 Sparse array tables + lookup,
                                           remove perfect hash
4       Refactor+Perf 3       +3746/-3223  unicode_normalize() rewrite,
                                           compat decomp separation


Net change: -5,020 lines.  unicode_norm_hashfunc.h deleted (-3,025 lines);
unicode_norm_table.h compacted from ~19K to ~9K lines.


6. RECOMMENDATIONS
------------------


1. Add tests for lines 337, 414, and 550 to the regression suite.
   Sample SQL provided in section 3.2 — all three verified correct.

2. Revisit zero value semantics in GenerateSparseArray.pm (see section 2.2
#1).
   The current code is correct, but the "0 means not found" convention is
   implicit.  Consider #define EMPTY 0xFFFF before adding more callers.

3. Review FRONTEND callers of unicode_normalize() to confirm NULL return
   is handled correctly (see section 2.2 #2).


7. CONCLUSION
-------------


The patch is a well-scoped improvement that closes a gap left open in
2020: the FRONTEND path was kept on bsearch to avoid inflating libpq,
while BACKEND used a perfect hash.  The sparse array is smaller than the
hash table yet faster than bsearch, making it viable for both contexts.

The only item worth discussing before commit is whether to include the
three new regression tests from section 3.2.  Everything else is clean.

coverage.tgz
Description: GNU Zip compressed data

Re: Improve the performance of Unicode Normalization Forms.

Reply via email to