Hrayr Matevosyan created OPENNLP-1563:
-----------------------------------------

             Summary: SimpleTokenizer.tokenizePos incorrectly splits words with 
non-spacing letters
                 Key: OPENNLP-1563
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1563
             Project: OpenNLP
          Issue Type: Bug
          Components: Tokenizer
    Affects Versions: 2.3.3
            Reporter: Hrayr Matevosyan


The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes words 
containing non-spacing letters. For example, the Arabic word "طُوّر" gets 
tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to