Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

via GitHub Tue, 23 Jun 2026 06:32:38 -0700


rzo1 commented on PR #1104:
URL: https://github.com/apache/opennlp/pull/1104#issuecomment-4779749205


   Same resource-loading concern as in #1103, and here it shows up twice. Both 
`ExtendedPictographic` and `WordBreakProperty` load their bundled `.txt` data 
inside a `static {}` block:
   
   ```java
   static {
     try (InputStream in = 
WordBreakProperty.class.getResourceAsStream(RESOURCE)) {
       ...
   ```
   
   In an application container the classloader that loads these classes is 
often not the loader the app runs with (servlet containers, OSGi, 
shaded/relocated jars). When the class's loader can't see the resource, the 
static block throws and the class is permanently poisoned: 
`ExceptionInInitializerError` first, then `NoClassDefFoundError` with the 
original cause gone on every later use. And because these two feed the UAX #29 
`WordSegmenter`/`WordTokenizer`, a poisoned `WordBreakProperty` takes the whole 
tokenizer down, not just one lookup.
   
   Suggestion is the same as on the foundation PR: move the load behind a lazy, 
recoverable accessor (double-checked `properties()` that calls `load()` on 
first use) instead of an eager `static {}` block, so a failure throws a normal 
catchable exception rather than killing the class for the lifetime of the 
loader. As noted there, this is a general rule for any resource/model loading, 
not specific to these two classes. (The `getResourceAsStream` in 
`WordBoundaryConformanceTest` is test-only, so it's fine.)
   
   ### On review size
   
   Like the foundation PR, this one is also a lot for a human to QM-review in 
one pass. Of the +6,762 lines, 3,976 are bundled data (`WordBreakTest.txt`, 
`WordBreakProperty.txt`, `ExtendedPictographic.txt`), so the real code is 
~2,800 lines, but it still bundles three independent concepts that could land 
as smaller stacked PRs:
   
   - **2a, UAX #29 tokenizer.** `WordSegmenter`, `WordTokenizer`, `WordType`, 
`WordToken`, `WordBreak`, `WordBreakProperty`, `ExtendedPictographic`, the 
bundled data, and the conformance tests. Self-contained and the bulk of the 
work.
   - **2b, Term model.** `Term`, `TermAnalyzer` and their tests.
   - **2c, NormalizationProfile registry.** `NormalizationProfile`, 
`NormalizationProfiles` and their tests.
   
   Smaller PRs over one large one give the reviewer a real chance to read each 
layer rather than skim a 6.7k diff. If re-stacking is more churn than it's 
worth, at minimum the description should point reviewers past the data files to 
the ~2,800 lines that actually need eyes. Generally I'd favour the step-by-step 
route: smaller, conceptually-scoped PRs over one big one, even when they have 
to stack.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1850: UAX #29 word tokenizer and the layered Term model (2/4) (opennlp)

Reply via email to