rzo1 commented on PR #559:
URL: https://github.com/apache/opennlp/pull/559#issuecomment-1846694280

   > Is there a spec for this behavior?
   
   The Penn Treebank guidelines suggest to tokenize as `ca` + `n't` and `do` + 
`n't`. The Python Guys in 
[NLTK](https://www.nltk.org/_modules/nltk/tokenize/treebank.html) adhere to 
this convention (if the Penn TreeBank Tokenizer is used). 
   
   Another example is the English phrasea 12-ft boat . How shall we handle the 
hyphenated length expression? Is this one or two or even three tokens. 
   
   From a very quick literature review it seems, that this ambiquity is an 
implementation detail and not really defined (as it depends on the actual 
use-case).
   
    Looking at the [Stanford 
Tokenizer](https://stanfordnlp.github.io/CoreNLP/tokenize.html) they have a 
bunch of configeration options for a lot of normalization stuff happening 
during tokenizing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to