[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850101#comment-17850101 ]
ASF GitHub Bot commented on OPENNLP-1563: ----------------------------------------- jzonthemtn commented on PR #602: URL: https://github.com/apache/opennlp/pull/602#issuecomment-2135699051 Thanks @demq! > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > ----------------------------------------------------------------------------- > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer > Affects Versions: 2.3.3 > Reporter: Hrayr Matevosyan > Priority: Major > Fix For: 2.3.4 > > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)