krickert commented on PR #1003:
URL: https://github.com/apache/opennlp/pull/1003#issuecomment-4157717642

   Thanks for the detailed feedback. We've addressed all four points made by 
@rzo1 . Here's a summary of what changed and the JMH data behind each decision.
   
   ## 1. Benchmarks (JMH)
   
   Replaced all hand-rolled `System.nanoTime()` benchmarks with proper JMH. 
Three benchmark classes in `src/jmh/java`:
   
   - `TokenizerMEBenchmark`
   - `SentenceDetectorMEBenchmark`
   - `POSTaggerMEBenchmark` (with `@Param` for cache configuration)
   
   Also fixed the existing JMH profile - the annotation processor wasn't wired 
into the compiler plugin, so the `BenchmarkList` was never generated. Added 
`maven-compiler-plugin` config with `annotationProcessorPaths` to the `jmh` 
profile.
   
   ### Approaches measured
   
   | Approach             | Description                                         
                                                                                
                           | Example                                            
       |
   | -------------------- | 
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 | --------------------------------------------------------- |
   | `newInstancePerCall` | Fresh ME per operation - the traditional pattern 
and baseline. Each call pays full constructor cost (BeamSearch, context 
generators, feature generator chain). | `new POSTaggerME(model).tag(tokens)`    
                  |
   | `instancePerThread`  | One ME per thread, reused across operations. No 
cross-thread sharing, no contention. Eliminates per-call constructor overhead.  
                               | `POSTaggerME tagger = new POSTaggerME(model);` 
then reuse |
   | `sharedInstance`     | Single ME shared by all threads. Maximum memory 
efficiency.                                                                     
                               | Pass one instance to all threads               
           |
   
   ### JMH Results (32 threads, all cores)
   
   ```
   Benchmark                                         Mode  Cnt       Score      
 Error  Units
   TokenizerMEBenchmark.newInstancePerCall          thrpt    5  570469 ±  6885  
ops/s
   TokenizerMEBenchmark.instancePerThread           thrpt    5  576365 ± 25758  
ops/s
   TokenizerMEBenchmark.sharedInstance              thrpt    5  570312 ± 12754  
ops/s
   
   SentenceDetectorMEBenchmark.newInstancePerCall   thrpt    5  837841 ±  7903  
ops/s
   SentenceDetectorMEBenchmark.instancePerThread    thrpt    5  853319 ± 25920  
ops/s
   SentenceDetectorMEBenchmark.sharedInstance       thrpt    5  849994 ± 31635  
ops/s
   
   POSTaggerMEBenchmark.newInstancePerCall          thrpt    5   24886 ±  2725  
ops/s
   POSTaggerMEBenchmark.instancePerThread           thrpt    5   62727 ±  2410  
ops/s
   POSTaggerMEBenchmark.sharedInstance              thrpt    5   61666 ±  7119  
ops/s
   ```
   
   Tokenizer and SentenceDetector: all approaches within error bars 
(lightweight constructors).
   POSTagger: **2.52x speedup** for `instancePerThread` vs `newInstancePerCall`.
   
   ## 2. Caches
   
   We restored all caches as ThreadLocal (per-thread, not shared). Same 
behavior as the originals in single-threaded use, safe under concurrency.
   
   We also added a `contextCacheSize` parameter to `POSTaggerME` and a 
`DISABLE_CACHE_PROPERTY` system property to `CachedFeatureGenerator` so the 
cache impact can be measured independently via JMH `@Param`.
   
   ### JMH Cache Impact Results (POSTagger, 32 threads)
   
   ```
   Benchmark                                (allCaches)   Mode  Cnt      Score  
    Error  Units
   POSTaggerMEBenchmark.instancePerThread          true  thrpt    5  64349 ± 
3216  ops/s
   POSTaggerMEBenchmark.instancePerThread         false  thrpt    5  39702 ±  
870  ops/s
   POSTaggerMEBenchmark.newInstancePerCall         true  thrpt    5  25394 ± 
2467  ops/s
   POSTaggerMEBenchmark.newInstancePerCall        false  thrpt    5  23954 ± 
2324  ops/s
   POSTaggerMEBenchmark.sharedInstance             true  thrpt    5  64663 ± 
2735  ops/s
   POSTaggerMEBenchmark.sharedInstance            false  thrpt    5  39620 ± 
1139  ops/s
   ```
   
   This told us which caches matter and which don't:
   
   | Cache                             | Restored as | JMH Impact               
           | Notes                                                              
                                                        |
   | --------------------------------- | ----------- | 
----------------------------------- | 
--------------------------------------------------------------------------------------------------------------------------
 |
   | `CachedFeatureGenerator`          | ThreadLocal | **1.62x** (64K vs 39K 
ops/s)        | Saves real work - caches outcome-independent features across 
beam candidates at the same token position                    |
   | `ConfigurablePOSContextGenerator` | ThreadLocal | **None** (65K vs 64K, 
within error) | Cache key includes prior tags, which differ per beam candidate 
- near-zero hit rate                                        |
   | `BeamSearch.contextsCache`        | ThreadLocal | **N/A**                  
           | Every caller in the codebase passes `cacheSize=0`. Never enabled 
for any ME class. Restored for API backward compatibility |
   
   ### Regarding the BeamSearch cache specifically
   
   > you note this was buggy (stale references from the shared probs[] buffer). 
That may be valid, but removing it rather than fixing it (e.g., storing copies) 
conflates a bug fix with the thread-safety change.
   
   We restored it as ThreadLocal with per-thread `probs[]` buffers, which fixes 
the stale-reference issue. However, we also checked every `new BeamSearch(...)` 
call in the codebase - every single one passes `cacheSize=0` (either via the 
2-arg constructor or explicitly). The cache has never been enabled by any 
caller in the project's history. We kept the 3-arg constructor for external API 
compatibility.
   
   ## 3. Thread-safety tests
   
   Addressed all sub-points:
   
   - **Contention forcing**: All tests now use `CyclicBarrier` - threads wait 
at the barrier before starting, ensuring they hit the critical section 
simultaneously.
   - **LemmatizerME**: Added `sharedLemmatizerProducesCorrectResults()` test.
   - **Thread/iteration counts**: `Math.max(8, availableProcessors())` threads, 
200 reps per thread.
   - **probs()**: Added `probsDoesNotThrowUnderConcurrency()` test - verifies 
`probs()` returns valid data (non-null, non-empty) under concurrent `tag()` 
calls without throwing. The returned values are last-writer-wins by design 
(documented in volatile field comments) - the core processing methods are what 
we guarantee correct under concurrency.
   
   ## 4. Missing ME classes
   
   All 7 ME classes are now covered:
   
   | Class              | Source change                          | 
Thread-safety test                               |
   | ------------------ | -------------------------------------- | 
------------------------------------------------ |
   | TokenizerME        | `volatile` + method-local              | 
`sharedTokenizerProducesCorrectResults()`        |
   | SentenceDetectorME | `volatile` + method-local              | 
`sharedSentenceDetectorProducesCorrectResults()` |
   | POSTaggerME        | `volatile` + method-local + null guard | 
`sharedPOSTaggerProducesCorrectResults()`        |
   | LemmatizerME       | `volatile` + method-local              | 
`sharedLemmatizerProducesCorrectResults()`       |
   | ChunkerME          | `volatile` + method-local + null guard | 
`sharedChunkerProducesCorrectResults()`          |
   | NameFinderME       | `volatile` + method-local + null guard | 
`sharedNameFinderProducesCorrectResults()`       |
   | LanguageDetectorME | Already thread-safe (stateless)        | 
`sharedLangDetectorProducesCorrectResults()`     |
   
   All 7 ME classes are annotated `@ThreadSafe`.
   
   ## 5. ThreadSafe*ME wrappers deprecated
   
   Since the ME classes are now themselves thread-safe, the `ThreadSafe*ME` 
wrappers are redundant. We deprecated all 7:
   
   - `ThreadSafeTokenizerME` → use `TokenizerME` directly
   - `ThreadSafeSentenceDetectorME` → use `SentenceDetectorME` directly
   - `ThreadSafePOSTaggerME` → use `POSTaggerME` directly
   - `ThreadSafeLemmatizerME` → use `LemmatizerME` directly
   - `ThreadSafeChunkerME` → use `ChunkerME` directly
   - `ThreadSafeNameFinderME` → use `NameFinderME` directly
   - `ThreadSafeLanguageDetectorME` → use `LanguageDetectorME` directly
   
   We also replaced all internal usages of `ThreadSafe*ME` with direct ME usage:
   
   - `Muc6NameSampleStreamFactory`: `ThreadSafeTokenizerME` → `TokenizerME`
   - `TwentyNewsgroupSampleStreamFactory`: `ThreadSafeTokenizerME` → 
`TokenizerME`
   - `POSTaggerMEIT`: `ThreadSafeTokenizerME` / `ThreadSafePOSTaggerME` → 
`TokenizerME` / `POSTaggerME`
   
   No internal code uses the wrappers anymore.
   
   ## Open item
   
   > a benchmark against a real-world dataset (e.g., from the eval/test 
corpora) would be far more convincing
   
   Agreed - this would strengthen the perf claims. The JMH benchmarks currently 
use the project's test data (`AnnotatedSentences.txt`). We're happy to add an 
eval-corpus benchmark as a follow-up, or include it in this PR if you'd prefer.
   
   Do you have any real-world dataset tests around that we can run it against 
quickly?  It's the only way I'd feel confident as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to