[ https://issues.apache.org/jira/browse/CTAKES-254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pei Chen resolved CTAKES-254. ----------------------------- Resolution: Fixed Assignee: Pei Chen Fixed in trunk > Apostrophe in contraction breaks TokenizerPTB > --------------------------------------------- > > Key: CTAKES-254 > URL: https://issues.apache.org/jira/browse/CTAKES-254 > Project: cTAKES > Issue Type: Bug > Components: ctakes-core > Affects Versions: 3.1 > Reporter: Pei Chen > Assignee: Pei Chen > Priority: Blocker > Fix For: 3.1.1 > > > Sample text: "on n'tion" > The single char followed by apostrophe will break the TokenizerPTB. > What the heck? > Results in a OutOfBoundsException > org.apache.ctakes.core.nlp.tokenizer.TokenizerPTB.setNumPosition(TokenizerPTB.java > 1147) > Sean Finan already had a patch for this sometime ago, but just wanted to see > if we missed something else here: > See below to add a check for empty string in the token: > Starting at line 1145: > // START > private void setNumPosition(WordToken wta, String tokenText) { > if ( tokenText.isEmpty() ) { > // was getting ioobE from tokenText.charAt(..) > // Possibilities like this (empty, null) should always be checked > // - but I wonder that we get (want) empty tokens at all. > // I believe that working with zero-length words is a bug, and this > is not a fix it merely avoids a crash. > wta.setNumPosition( TokenizerAnnotator.TOKEN_NUM_POS_NONE ); > return; > } > if (isDigit(tokenText.charAt(0))) { > // END -- This message was sent by Atlassian JIRA (v6.1#6144)