[ https://issues.apache.org/jira/browse/OPENNLP-141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693668#comment-17693668 ]
ASF GitHub Bot commented on OPENNLP-141: ---------------------------------------- mawiesne commented on code in PR #506: URL: https://github.com/apache/opennlp/pull/506#discussion_r1118109618 ########## opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java: ########## @@ -25,24 +25,45 @@ public class Factory { - public static final String DEFAULT_ALPHANUMERIC = "^[A-Za-z0-9]+$"; + public static final Pattern DEFAULT_ALPHANUMERIC = Pattern.compile("^[A-Za-z0-9]+$"); Review Comment: Others might reference this. That's why I left its visibility untouched. Even though the type changed to Pattern, the original forma can easily be accessed via `.pattern()` which returns it as a String. > Tokenizers alpha numeric optimization only recognizes a-z as alpha chars > ------------------------------------------------------------------------ > > Key: OPENNLP-141 > URL: https://issues.apache.org/jira/browse/OPENNLP-141 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer > Affects Versions: tools-1.5.0-sourceforge > Reporter: Jörn Kottmann > Assignee: Martin Wiesner > Priority: Minor > > The Tokenizer has an optimization which skips tokens which are only made of > numerics or alpha chars. In foreign languages the alpha chars contain umlauts > and other letters which are not included in the a-z range. -- This message was sent by Atlassian Jira (v8.20.10#820010)