[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Zowalla resolved OPENNLP-1563. -------------------------------------- Fix Version/s: 2.3.4 Resolution: Fixed > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > ----------------------------------------------------------------------------- > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer > Affects Versions: 2.3.3 > Reporter: Hrayr Matevosyan > Priority: Major > Fix For: 2.3.4 > > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)