krickert commented on PR #1103:
URL: https://github.com/apache/opennlp/pull/1103#issuecomment-4780105455

   
   @rzo1 Thanks — both of your points on the foundation are addressed.
   
   **Split into 1a + 1b (done).** I split the foundation along the history 
exactly where you suggested:
   
   - **#1108 — engine (1a):** 
`CharClass`/`CodePointSet`/`UnicodeWhitespace`/`UnicodeDash`, the 
per-code-point rungs, `Dimension`, the non-aligned `TextNormalizer`, and 
`confusables.txt` with all its `LICENSE`/`NOTICE`/`rat-excludes` bookkeeping. 
Mostly mechanical substitution, and where the license review belongs.
   - **#1109 — offset/alignment layer (1b):** `Alignment`, `AlignedText`, 
`OffsetAwareNormalizer`, `buildAligned()`, the `*Aligned` `CharClass` variants, 
and the dense span-mapping tests (binary-search mapping, expansion/deletion 
edge cases). The conceptually hard ~800 lines, isolated for a focused read.
   
   `#1104` (tokenizer) now bases on `#1109`. So the stack is now **1a → 1b → 
tokenizer → DL → docs**, each well under your ~1.5k-real-code target, and the 
10k-line `confusables.txt` data file is contained in 1a. I closed `#1103` 
pointing at the two replacements.
   
   **Static-initializer resource loading (done, and generalized).** Agreed on 
the rule. All three bundled-data loaders that did classpath I/O in a `static 
{}` block now load lazily on first use through a double-checked accessor, so a 
resource the loader can't see surfaces as a catchable exception at call time 
rather than an `ExceptionInInitializerError` that poisons the class:
   
   - `Confusables` (1a / #1108)
   - `WordBreakProperty` and `ExtendedPictographic` (tokenizer / #1104) — the 
latter two wrap their tables in a small immutable holder loaded via the same 
pattern.
   
   The `List.of(...)` static blocks in `UnicodeWhitespace`/`UnicodeDash` are 
left as-is (no I/O, no classloader risk), as you noted.
   
   Each layer builds and tests green on its own (`mvn -pl … -am verify`, plus 
checkstyle + forbiddenapis across the full reactor).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to