Hrayr Matevosyan created OPENNLP-1563: -----------------------------------------
Summary: SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters Key: OPENNLP-1563 URL: https://issues.apache.org/jira/browse/OPENNLP-1563 Project: OpenNLP Issue Type: Bug Components: Tokenizer Affects Versions: 2.3.3 Reporter: Hrayr Matevosyan The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes words containing non-spacing letters. For example, the Arabic word "طُوّر" gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)