[ 
https://issues.apache.org/jira/browse/OPENNLP-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794561#comment-17794561
 ] 

ASF GitHub Bot commented on OPENNLP-1479:
-----------------------------------------

rzo1 commented on PR #559:
URL: https://github.com/apache/opennlp/pull/559#issuecomment-1846694280

   > Is there a spec for this behavior?
   
   The Penn Treebank guidelines suggest to tokenize as `ca` + `n't` and `do` + 
`n't`. The Python Guys in 
[NLTK](https://www.nltk.org/_modules/nltk/tokenize/treebank.html) adhere to 
this convention (if the Penn TreeBank Tokenizer is used). 
   
   Another example is the English phrasea 12-ft boat . How shall we handle the 
hyphenated length expression? Is this one or two or even three tokens. 
   
   From a very quick literature review it seems, that this ambiquity is an 
implementation detail and not really defined (as it depends on the actual 
use-case).
   
    Looking at the [Stanford 
Tokenizer](https://stanfordnlp.github.io/CoreNLP/tokenize.html) they have a 
bunch of configeration options for a lot of normalization stuff happening 
during tokenizing.




> Write better tests for pattern verification (tokenizers)
> --------------------------------------------------------
>
>                 Key: OPENNLP-1479
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1479
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: 2.1.1
>            Reporter: Bruno P. Kinoshita
>            Assignee: Lara Marinov
>            Priority: Major
>             Fix For: 2.3.2
>
>
> From [https://github.com/apache/opennlp/pull/516#issuecomment-1455015772]
> At the moment our tests verify that the tokenizer objects are created 
> correctly (i.e. tests getters and setters, constructor, etc.), without 
> verifying the actual behavior when used in conjunction with other classes 
> (factory, tokenizer, trainers, etc).
> It would be best to test the patterns used in the factories for different 
> languages with some interesting sample data (maybe something from project 
> gutenberg, open source news sites, etc.).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to