krickert commented on PR #1003:
URL: https://github.com/apache/opennlp/pull/1003#issuecomment-4179934825

   All fixes are in. Here's what changed since my last comment.
   
   ## The cache issue
   
   `DefaultPOSContextGenerator`'s cache appears to be dead code in practice. 
The modern POS pipeline uses `ConfigurablePOSContextGenerator`, which wraps an 
`AdaptiveFeatureGenerator` chain. `DefaultPOSContextGenerator` is the legacy 
path, and its cache (`contextsCache`) was a per-sentence cache keyed by 
(`index`, `tokens`) identity.
   
   But more importantly, the JMH data already proves it. The `allCaches=true` 
vs `allCaches=false` param in `POSTaggerMEBenchmark` toggles 
`CachedFeatureGenerator`; that's the only cache that shows measurable impact. 
The `ConfigurablePOSContextGenerator` cache (which we did restore as 
`ThreadLocal`) showed zero impact (70K vs 70K, within error). 
`DefaultPOSContextGenerator`'s cache appears to be less useful, as it was a 
simpler version of the same thing.
   
   We kept the 2-arg constructor `DefaultPOSContextGenerator(int cacheSize, 
Dictionary dict)` with `@Deprecated(since = "3.0.0")` so external callers don't 
break, but the `cacheSize` param is ignored.
   
   I suspect the performance gains were earlier before it was thread safe?  
From the tests we have, it seemed like a good choice to remove.
   
   ## Additional thread-safety issues found and fixed
   
   After the initial round of fixes, I found 5 more stateful feature generators 
used by NameFinderME that had shared mutable state. These would have caused 
silent corruption under concurrent `find()` calls :
   
   | Feature Generator | Mutable State | Fix |
   |-------------------|--------------|-----|
   | `PreviousMapFeatureGenerator` | `Map<String,String> previousMap` | 
`ThreadLocal<Map>` |
   | `PreviousTwoMapFeatureGenerator` | `Map<String,String> previousMap` | 
`ThreadLocal<Map>` |
   | `AdditionalContextFeatureGenerator` | `String[][] currentContext` | 
`ThreadLocal<String[][]>` |
   | `DocumentBeginFeatureGenerator` | `String[] firstSentence` | 
`ThreadLocal<String[]>` |
   | `InSpanGenerator` | `String[] currentSentence`, `Span[] currentNames` | 
`ThreadLocal<CacheState>` |
   
   I have a 32 core machine, but the first run didn't error out.  Luckily, the 
second run revealed NPEs  under heavy NameFinder concurrency - the feature 
generators were sharing sentence-level state across threads.
   
   
   ## Other fixes in this push
   
   - **LemmatizerME.predictSES() null guard**: Added `if (seq == null)` guard, 
consistent with POSTaggerME, ChunkerME, and NameFinderME. Previously would NPE 
if `bestSequence()` returned null.
   
   - **System property renamed**: `opennlp.cache.disabled` → 
`opennlp.featuregen.cache.disabled` to avoid confusion with other caches 
(BeamSearch, ConfigurablePOS).
   
   - **ThreadLocal lifecycle documentation**: Added javadoc notes to all ME 
classes and key infrastructure classes (BeamSearch, CachedFeatureGenerator, 
ConfigurablePOSContextGenerator) warning about classloader pinning in container 
environments (Jakarta EE). Example:
   
     > **Note:** In container environments with classloader isolation (e.g. 
Jakarta EE), ensure instances do not outlive the application's lifecycle, as 
underlying components use ThreadLocal state that may pin the classloader.
   
   - **BeamSearch compile fix**: The `prev` queue variable was incorrectly 
marked `final` despite being reassigned in the queue-swap loop. Removed `final`.
   
   ## Updated JMH Results (32 threads)
   
   Re-ran all benchmarks after the fixes:
   Benchmark | Mode | Cnt | Score | Error | Units
   -- | -- | -- | -- | -- | --
   TokenizerMEBenchmark.newInstancePerCall | thrpt | 5 | 568968 | ± 15154 | 
ops/s
   TokenizerMEBenchmark.instancePerThread | thrpt | 5 | 575028 | ± 20304 | ops/s
   TokenizerMEBenchmark.sharedInstance | thrpt | 5 | 569963 | ± 24899 | ops/s
   SentenceDetectorMEBenchmark.newInstancePerCall | thrpt | 5 | 842962 | ± 
21355 | ops/s
   SentenceDetectorMEBenchmark.instancePerThread | thrpt | 5 | 862440 | ± 32795 
| ops/s
   SentenceDetectorMEBenchmark.sharedInstance | thrpt | 5 | 865518 | ± 16740 | 
ops/s
   POSTaggerMEBenchmark.newInstancePerCall | thrpt | 5 | 25828 | ± 3883 | ops/s
   POSTaggerMEBenchmark.instancePerThread | thrpt | 5 | 70077 | ± 2353 | ops/s
   POSTaggerMEBenchmark.sharedInstance | thrpt | 5 | 70556 | ± 1704 | ops/s
   
   POSTagger: **2.71x speedup** for reuse vs new-instance-per-call (70K vs 
25.8K ops/s). Tokenizer and SentenceDetector remain within error bars.
   
   ### Cache impact (POSTagger, 32 threads)
   
   Benchmark | allCaches | Mode | Cnt | Score | Error | Units
   -- | -- | -- | -- | -- | -- | --
   POSTaggerMEBenchmark.instancePerThread | true | thrpt | 5 | 70077 | ± 2353 | 
ops/s
   POSTaggerMEBenchmark.instancePerThread | false | thrpt | 5 | 42094 | ± 889 | 
ops/s
   POSTaggerMEBenchmark.newInstancePerCall | true | thrpt | 5 | 25828 | ± 3883 
| ops/s
   POSTaggerMEBenchmark.newInstancePerCall | false | thrpt | 5 | 23510 | ± 5164 
| ops/s
   POSTaggerMEBenchmark.sharedInstance | true | thrpt | 5 | 70556 | ± 1704 | 
ops/s
   POSTaggerMEBenchmark.sharedInstance | false | thrpt | 5 | 41764 | ± 1783 | 
ops/s
   
   CachedFeatureGenerator: **1.67x** impact (70K vs 42K). Confirmed it's doing 
real work.
   
   ## Summary of all changes
   
   **Thread-safe source files (18):**
   - 7 ME classes: TokenizerME, SentenceDetectorME, POSTaggerME, LemmatizerME, 
ChunkerME, NameFinderME, LanguageDetectorME
   - 3 context generators: DefaultSDContextGenerator, 
ConfigurablePOSContextGenerator, DefaultPOSContextGenerator
   - 6 feature generators: CachedFeatureGenerator, PreviousMapFeatureGenerator, 
PreviousTwoMapFeatureGenerator, AdditionalContextFeatureGenerator, 
DocumentBeginFeatureGenerator, InSpanGenerator
   - 1 search: BeamSearch
   - 1 subclass update: SentenceContextGenerator (Thai)
   
   **Deprecated (7):** All ThreadSafe*ME wrappers
   
   **Tests:** 8 JUnit tests covering all 7 ME classes + probs() concurrency, 3 
JMH benchmarks
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to