Re: [PR] OPENNLP-1832: Add SymSpell-based SpellChecker component (opennlp)

via GitHub Mon, 01 Jun 2026 22:38:35 -0700


rzo1 commented on PR #1057:
URL: https://github.com/apache/opennlp/pull/1057#issuecomment-4599026521


   > How do you see this fitting into the typical OpenNLP pipeline?
   
   I think there are some concrete use cases for spell correction in OpenNLP 
pipeplines:
   
   - Noisy text upstream of classification: doccat / sentiment / langdetect on 
social media data, product reviews, support tickets, chat logs. Typos and 
merged/split words 
   - query correction for OpenNLP-driven query understanding (NER on queries, 
intent doccat). Short queries are typo-heavy and the win is mostly at the token 
level.
   - correction / lookup per token before NER.
   - OCR / ASR post-processing before NER or relation extraction (this is my 
usecase from which I am comming from): single-character errors and 
missing/inserted spaces are exactly SymSpell's target shape.
   - Wrap a noisy Web-crawled corpus's ObjectStream<String> with 
SpellCorrectingObjectStream (token-count preserving, so parallel annotations  
stay aligned) when building training material for tokenizer / POS / NER.
   
   I think there are some additional use cases as well, which might benefit 
from spell correction streams in OpenNLP itself. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-1832: Add SymSpell-based SpellChecker component (opennlp)

Reply via email to