rzo1 commented on PR #1057: URL: https://github.com/apache/opennlp/pull/1057#issuecomment-4599026521
> How do you see this fitting into the typical OpenNLP pipeline? I think there are some concrete use cases for spell correction in OpenNLP pipeplines: - Noisy text upstream of classification: doccat / sentiment / langdetect on social media data, product reviews, support tickets, chat logs. Typos and merged/split words - query correction for OpenNLP-driven query understanding (NER on queries, intent doccat). Short queries are typo-heavy and the win is mostly at the token level. - correction / lookup per token before NER. - OCR / ASR post-processing before NER or relation extraction (this is my usecase from which I am comming from): single-character errors and missing/inserted spaces are exactly SymSpell's target shape. - Wrap a noisy Web-crawled corpus's ObjectStream<String> with SpellCorrectingObjectStream (token-count preserving, so parallel annotations stay aligned) when building training material for tokenizer / POS / NER. I think there are some additional use cases as well, which might benefit from spell correction streams in OpenNLP itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
