krickert commented on PR #1003:
URL: https://github.com/apache/opennlp/pull/1003#issuecomment-4157717642
Thanks for the detailed feedback. We've addressed all four points made by
@rzo1 . Here's a summary of what changed and the JMH data behind each decision.
## 1. Benchmarks (JMH)
Replaced all hand-rolled `System.nanoTime()` benchmarks with proper JMH.
Three benchmark classes in `src/jmh/java`:
- `TokenizerMEBenchmark`
- `SentenceDetectorMEBenchmark`
- `POSTaggerMEBenchmark` (with `@Param` for cache configuration)
Also fixed the existing JMH profile - the annotation processor wasn't wired
into the compiler plugin, so the `BenchmarkList` was never generated. Added
`maven-compiler-plugin` config with `annotationProcessorPaths` to the `jmh`
profile.
### Approaches measured
| Approach | Description
| Example
|
| -------------------- |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
| --------------------------------------------------------- |
| `newInstancePerCall` | Fresh ME per operation - the traditional pattern
and baseline. Each call pays full constructor cost (BeamSearch, context
generators, feature generator chain). | `new POSTaggerME(model).tag(tokens)`
|
| `instancePerThread` | One ME per thread, reused across operations. No
cross-thread sharing, no contention. Eliminates per-call constructor overhead.
| `POSTaggerME tagger = new POSTaggerME(model);`
then reuse |
| `sharedInstance` | Single ME shared by all threads. Maximum memory
efficiency.
| Pass one instance to all threads
|
### JMH Results (32 threads, all cores)
```
Benchmark Mode Cnt Score
Error Units
TokenizerMEBenchmark.newInstancePerCall thrpt 5 570469 ± 6885
ops/s
TokenizerMEBenchmark.instancePerThread thrpt 5 576365 ± 25758
ops/s
TokenizerMEBenchmark.sharedInstance thrpt 5 570312 ± 12754
ops/s
SentenceDetectorMEBenchmark.newInstancePerCall thrpt 5 837841 ± 7903
ops/s
SentenceDetectorMEBenchmark.instancePerThread thrpt 5 853319 ± 25920
ops/s
SentenceDetectorMEBenchmark.sharedInstance thrpt 5 849994 ± 31635
ops/s
POSTaggerMEBenchmark.newInstancePerCall thrpt 5 24886 ± 2725
ops/s
POSTaggerMEBenchmark.instancePerThread thrpt 5 62727 ± 2410
ops/s
POSTaggerMEBenchmark.sharedInstance thrpt 5 61666 ± 7119
ops/s
```
Tokenizer and SentenceDetector: all approaches within error bars
(lightweight constructors).
POSTagger: **2.52x speedup** for `instancePerThread` vs `newInstancePerCall`.
## 2. Caches
We restored all caches as ThreadLocal (per-thread, not shared). Same
behavior as the originals in single-threaded use, safe under concurrency.
We also added a `contextCacheSize` parameter to `POSTaggerME` and a
`DISABLE_CACHE_PROPERTY` system property to `CachedFeatureGenerator` so the
cache impact can be measured independently via JMH `@Param`.
### JMH Cache Impact Results (POSTagger, 32 threads)
```
Benchmark (allCaches) Mode Cnt Score
Error Units
POSTaggerMEBenchmark.instancePerThread true thrpt 5 64349 ±
3216 ops/s
POSTaggerMEBenchmark.instancePerThread false thrpt 5 39702 ±
870 ops/s
POSTaggerMEBenchmark.newInstancePerCall true thrpt 5 25394 ±
2467 ops/s
POSTaggerMEBenchmark.newInstancePerCall false thrpt 5 23954 ±
2324 ops/s
POSTaggerMEBenchmark.sharedInstance true thrpt 5 64663 ±
2735 ops/s
POSTaggerMEBenchmark.sharedInstance false thrpt 5 39620 ±
1139 ops/s
```
This told us which caches matter and which don't:
| Cache | Restored as | JMH Impact
| Notes
|
| --------------------------------- | ----------- |
----------------------------------- |
--------------------------------------------------------------------------------------------------------------------------
|
| `CachedFeatureGenerator` | ThreadLocal | **1.62x** (64K vs 39K
ops/s) | Saves real work - caches outcome-independent features across
beam candidates at the same token position |
| `ConfigurablePOSContextGenerator` | ThreadLocal | **None** (65K vs 64K,
within error) | Cache key includes prior tags, which differ per beam candidate
- near-zero hit rate |
| `BeamSearch.contextsCache` | ThreadLocal | **N/A**
| Every caller in the codebase passes `cacheSize=0`. Never enabled
for any ME class. Restored for API backward compatibility |
### Regarding the BeamSearch cache specifically
> you note this was buggy (stale references from the shared probs[] buffer).
That may be valid, but removing it rather than fixing it (e.g., storing copies)
conflates a bug fix with the thread-safety change.
We restored it as ThreadLocal with per-thread `probs[]` buffers, which fixes
the stale-reference issue. However, we also checked every `new BeamSearch(...)`
call in the codebase - every single one passes `cacheSize=0` (either via the
2-arg constructor or explicitly). The cache has never been enabled by any
caller in the project's history. We kept the 3-arg constructor for external API
compatibility.
## 3. Thread-safety tests
Addressed all sub-points:
- **Contention forcing**: All tests now use `CyclicBarrier` - threads wait
at the barrier before starting, ensuring they hit the critical section
simultaneously.
- **LemmatizerME**: Added `sharedLemmatizerProducesCorrectResults()` test.
- **Thread/iteration counts**: `Math.max(8, availableProcessors())` threads,
200 reps per thread.
- **probs()**: Added `probsDoesNotThrowUnderConcurrency()` test - verifies
`probs()` returns valid data (non-null, non-empty) under concurrent `tag()`
calls without throwing. The returned values are last-writer-wins by design
(documented in volatile field comments) - the core processing methods are what
we guarantee correct under concurrency.
## 4. Missing ME classes
All 7 ME classes are now covered:
| Class | Source change |
Thread-safety test |
| ------------------ | -------------------------------------- |
------------------------------------------------ |
| TokenizerME | `volatile` + method-local |
`sharedTokenizerProducesCorrectResults()` |
| SentenceDetectorME | `volatile` + method-local |
`sharedSentenceDetectorProducesCorrectResults()` |
| POSTaggerME | `volatile` + method-local + null guard |
`sharedPOSTaggerProducesCorrectResults()` |
| LemmatizerME | `volatile` + method-local |
`sharedLemmatizerProducesCorrectResults()` |
| ChunkerME | `volatile` + method-local + null guard |
`sharedChunkerProducesCorrectResults()` |
| NameFinderME | `volatile` + method-local + null guard |
`sharedNameFinderProducesCorrectResults()` |
| LanguageDetectorME | Already thread-safe (stateless) |
`sharedLangDetectorProducesCorrectResults()` |
All 7 ME classes are annotated `@ThreadSafe`.
## 5. ThreadSafe*ME wrappers deprecated
Since the ME classes are now themselves thread-safe, the `ThreadSafe*ME`
wrappers are redundant. We deprecated all 7:
- `ThreadSafeTokenizerME` → use `TokenizerME` directly
- `ThreadSafeSentenceDetectorME` → use `SentenceDetectorME` directly
- `ThreadSafePOSTaggerME` → use `POSTaggerME` directly
- `ThreadSafeLemmatizerME` → use `LemmatizerME` directly
- `ThreadSafeChunkerME` → use `ChunkerME` directly
- `ThreadSafeNameFinderME` → use `NameFinderME` directly
- `ThreadSafeLanguageDetectorME` → use `LanguageDetectorME` directly
We also replaced all internal usages of `ThreadSafe*ME` with direct ME usage:
- `Muc6NameSampleStreamFactory`: `ThreadSafeTokenizerME` → `TokenizerME`
- `TwentyNewsgroupSampleStreamFactory`: `ThreadSafeTokenizerME` →
`TokenizerME`
- `POSTaggerMEIT`: `ThreadSafeTokenizerME` / `ThreadSafePOSTaggerME` →
`TokenizerME` / `POSTaggerME`
No internal code uses the wrappers anymore.
## Open item
> a benchmark against a real-world dataset (e.g., from the eval/test
corpora) would be far more convincing
Agreed - this would strengthen the perf claims. The JMH benchmarks currently
use the project's test data (`AnnotatedSentences.txt`). We're happy to add an
eval-corpus benchmark as a follow-up, or include it in this PR if you'd prefer.
Do you have any real-world dataset tests around that we can run it against
quickly? It's the only way I'd feel confident as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]