krickert commented on PR #1003:
URL: https://github.com/apache/opennlp/pull/1003#issuecomment-4269034171

   Thanks for the careful review - all three items are now fixed on 
`feature/thread-safe-me` and pushed. Summary below.
   
   ### 1. `POSTaggerNameFeatureGenerator` - race + potential `AIOOBE` (P0)
   
   **Confirmed.** The shared `cachedTokens` / `cachedTags` instance fields 
could be torn between threads, leading to either stale tags or 
`ArrayIndexOutOfBoundsException` when one thread shrank the array while another 
was indexing into it.
   
   **Fix:** moved the per-sentence cache into a per-thread `CacheState` held in 
a `ThreadLocal`, so each thread sees its own `(tokens, tags)` pair. Class is 
now annotated `@ThreadSafe` and the Javadoc documents the contract.
   
   ```java
   private static final class CacheState {
       String[] cachedTokens;
       String[] cachedTags;
   }
   
   private final ThreadLocal<CacheState> threadState =
           ThreadLocal.withInitial(CacheState::new);
   
   @Override
   public void createFeatures(List<String> feats, String[] toks, int index,
                              String[] preds) {
       final CacheState state = threadState.get();
       if (!Arrays.equals(state.cachedTokens, toks)) {
           state.cachedTokens = toks;
           state.cachedTags = posTagger.tag(toks);
       }
       feats.add("pos=" + state.cachedTags[index]);
   }
   ```
   
   Added 
`POSTaggerNameFeatureGeneratorTest#testConcurrentCreateFeaturesIsThreadSafe` - 
16 threads × 5k iterations across mixed-length sentences, asserts no exceptions 
and consistent results.
   
   Commit: `cb4d7f2e`
   
   ### 2. `DictionaryFeatureGenerator` - `isg` publication
   
   You were right that this is *almost* a write-once field, but the setter 
(`setDictionary`) is public, so a late reconfiguration on one thread could race 
with `createFeatures` on another. Cheap to make correct:
   
   - `isg` field is now `volatile`
   - `createFeatures` reads it once into a local before delegating
   - Class annotated `@ThreadSafe`, Javadoc clarifies the late-binding semantics
   
   Commit: `c6274c46`
   
   ### 3. `NameFinderME` - nested `ThreadLocal` overhead
   
   Confirmed and simplified. The previous design stored a fresh 
`AdditionalContextFeatureGenerator` per thread inside `NameFinderState`; since 
AFG itself owns a `ThreadLocal<String[][]>`, each AFG was only ever touched by 
one thread, making its inner `ThreadLocal` pure overhead.
   
   Now there is **one shared `AdditionalContextFeatureGenerator`** for the 
`NameFinderME` instance, and the per-call `additionalContext` is set/cleared 
via `clearForCurrentThread()` around `find(...)`. The per-thread `bestSequence` 
slot moved to `LastResultOwnerOrThreadLocal`, matching the pattern used by 
`POSTaggerME` / `ChunkerME`. Net effect: one fewer allocation per `find()` 
call, one fewer `ThreadLocal` lookup per token, and a smaller object graph per 
thread.
   
   Commit: `19a2ac68`
   
   ### Verification
   
   - `mvn -pl opennlp-core/opennlp-runtime verify` - green, including:
     - `ThreadSafetyBenchmarkIT` (POS / Chunker / NameFinder / Lemmatizer 
concurrent equivalence)
     - new 
`POSTaggerNameFeatureGeneratorTest#testConcurrentCreateFeaturesIsThreadSafe`
   - 32-thread × 200-iteration ad-hoc stress run on `NameFinderME` and 
`POSTaggerME`: 1 unique output per scenario, identical to the single-thread 
baseline.
   
   
   Again, thanks again for the review.  I'm going to use this for a ton of 
docs, so the performance gains are going to be seen on day one for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to