[ https://issues.apache.org/jira/browse/OPENNLP-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794561#comment-17794561 ]
ASF GitHub Bot commented on OPENNLP-1479: ----------------------------------------- rzo1 commented on PR #559: URL: https://github.com/apache/opennlp/pull/559#issuecomment-1846694280 > Is there a spec for this behavior? The Penn Treebank guidelines suggest to tokenize as `ca` + `n't` and `do` + `n't`. The Python Guys in [NLTK](https://www.nltk.org/_modules/nltk/tokenize/treebank.html) adhere to this convention (if the Penn TreeBank Tokenizer is used). Another example is the English phrasea 12-ft boat . How shall we handle the hyphenated length expression? Is this one or two or even three tokens. From a very quick literature review it seems, that this ambiquity is an implementation detail and not really defined (as it depends on the actual use-case). Looking at the [Stanford Tokenizer](https://stanfordnlp.github.io/CoreNLP/tokenize.html) they have a bunch of configeration options for a lot of normalization stuff happening during tokenizing. > Write better tests for pattern verification (tokenizers) > -------------------------------------------------------- > > Key: OPENNLP-1479 > URL: https://issues.apache.org/jira/browse/OPENNLP-1479 > Project: OpenNLP > Issue Type: Improvement > Components: Tokenizer > Affects Versions: 2.1.1 > Reporter: Bruno P. Kinoshita > Assignee: Lara Marinov > Priority: Major > Fix For: 2.3.2 > > > From [https://github.com/apache/opennlp/pull/516#issuecomment-1455015772] > At the moment our tests verify that the tokenizer objects are created > correctly (i.e. tests getters and setters, constructor, etc.), without > verifying the actual behavior when used in conjunction with other classes > (factory, tokenizer, trainers, etc). > It would be best to test the patterns used in the factories for different > languages with some interesting sample data (maybe something from project > gutenberg, open source news sites, etc.). -- This message was sent by Atlassian Jira (v8.20.10#820010)