Re: [PR] Opennlp 1850 whitespace utf normalizae (opennlp)

via GitHub Fri, 19 Jun 2026 12:16:50 -0700


krickert commented on PR #1101:
URL: https://github.com/apache/opennlp/pull/1101#issuecomment-4754147085


   Nice! Thanks for the fast feedback — I knew there was a lot packed in here 
and totally expected
   the questions. This code had caffeine behind it plus some LLM-assisted 
review passes, but I read every token it produced and I own all of it. 
   
   I'll work through everything after I digest, but here's where my head's at 
on the big ones:
   
   ## Splitting it up — I'd vote two stacks, not stacking onto the other PRs
   
   Agreed it's too much for one review. Rather than stack onto the in-flight 
PRs (this stuff is
   fairly self-contained and would just tangle with those), I'd propose 
splitting **this** along its
   natural dependency seam into two stacked PRs:
   
   **Stack 1 — Unicode normalization foundation** (the original OPENNLP-1850 
scope):
   `CharClass` / `CodePointSet`, the `White_Space` / `Dash` sets, the 
`CharSequenceNormalizer` family
   (NFC, NFKC, whitespace, dash, case-fold, accent-fold, quotes, digits, …), 
`NormalizedText` +
   `OffsetMap`, `TextNormalizer` / `TextAnalyzer`, plus the DL 
`InferenceOptions` whitespace/dash
   opt-ins. No segmentation, no perf engine — lowest risk, lands first.
   
   **Stack 2 — UAX #29 tokenizer + the advanced layers** (builds on Stack 1):
   `WordSegmenter` / `WordTokenizer` / `WordType` and the boundary engine (this 
is where the perf
   work and the transition/lookup tables live), the layered `Term` model, 
confusable folding
   (UTS #39), and the per-language profiles — all of which sit on top of Stack 
1's normalizers + the
   tokenizer.
   
   That contains the perf/lookup-table conversation entirely in Stack 2 and 
lets the foundation land
   clean.  Is that ok?
   
   ## "Take the lookup tables out of the API" — yep, 100% agreed
   
   Good catch and an easy yes. `opennlp-api` should stay a thin contract 
module. I'll push the
   data/engine-bearing classes (`CharClass`, `CodePointSet`, 
`UnicodeWhitespace`, `UnicodeDash`) down
   into `opennlp-runtime` and leave only the contracts in the API — 
`CharSequenceNormalizer`, and the
   lightweight value types (`NormalizedText` / `OffsetMap`) if we still want 
them reachable from the
   DL side, since those carry zero tables. The UAX #29 tables and the 
confusables data already live in
   `opennlp-runtime`, so nothing table-shaped is left in the API once this 
moves.
   
   Key point though: this is a **module-placement** fix, not an API redesign. 
The interface shape
   itself I think is sound (more below) - we're just relocating the 
implementation.
   
   ## The speed overlap, and the shape
   
   At first as designed I discovered performance concerns which drove a lot of 
the approach, so let me just put the numbers down. Measured with the project's 
own JMH harness (2 forks, warmed up), against **Lucene 10.3.2** 
`StandardTokenizer` and today's OpenNLP `SimpleTokenizer`, on a Latin corpus 
(Mchars/s):
   
   | tokenizer                         | Mchars/s | vs OpenNLP today | vs 
Lucene 10 |
   | --------------------------------- | -------- | ---------------- | 
------------ |
   | OpenNLP `SimpleTokenizer` (today) | 282      | 1.00×            | 0.88×    
    |
   | Lucene 10.3.2 `StandardTokenizer` | 320      | 1.13×            | 1.00×    
    |
   | **new — boundary scan**           | **388**  | **1.38×**        | 
**1.21×**    |
   | **new — streaming tokenize**      | **335**  | **1.19×**        | 
**1.05×**    |
   
   (On the CJK/emoji-heavy corpus the boundary scan is ~parity with Lucene and 
still ~1.7× over `SimpleTokenizer`.)
   
   Two things I'd flag:
   
   - OpenNLP's tokenizer is currently *behind* Lucene; this puts it *ahead* — 
while also being
     current-Unicode-correct (17.0) and offset-preserving, which 
`StandardTokenizer` isn't doing for
     free.
   - The boundary output is byte-identical before and after the perf work — 
1944/1944 on the official
     UAX #29 conformance suite the whole way through. The speed comes from a 
transition table that's
     *derived from* the readable rule cascade at class-load, so the rules stay 
the source of truth and
     the table can't drift from them.
   
   On the overlap worry: this is tokenizer-algorithm perf, which is orthogonal 
to the ME thread-safety / allocation work — different code, complementary in 
spirit.  Since there's no collision I think it's OK.
   
   And on shape: the public surface (the `Tokenizer` interface impl, the 
streaming `TokenHandler`, the `Term` / `Dimension` layering) is independent of 
the engine internals — we could return or even swap the boundary engine without 
touching the interface. So I'd like to keep the interface as-is and treat the 
engine/tables as the thing we iterate on. Is that ok?
   
   The Lucene head-to-head benchmark isn't meant to be a competition - but a 
baselike.  I can add it to the benchmark though but don't want to complicate 
the build. No problem checking it in temporarily if you want to check it out 
though.
   
   Lookin' forward to some feedback. Thanks again for the quick turnaround!  
This was honestly a lot of fun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Opennlp 1850 whitespace utf normalizae (opennlp)

Reply via email to