[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849994#comment-17849994 ]
ASF GitHub Bot commented on OPENNLP-1563: ----------------------------------------- rzo1 merged PR #602: URL: https://github.com/apache/opennlp/pull/602 > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > ----------------------------------------------------------------------------- > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer > Affects Versions: 2.3.3 > Reporter: Hrayr Matevosyan > Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)