Tristan Nixon created OPENNLP-857: ------------------------------------- Summary: ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer Key: OPENNLP-857 URL: https://issues.apache.org/jira/browse/OPENNLP-857 Project: OpenNLP Issue Type: Improvement Components: Parser Affects Versions: 1.6.0 Reporter: Tristan Nixon
It would be nice if the ParserTool would make use of a real tokenizer. In addition to being the "right" thing to do, it would obviate issues like OPENNLP-240 when using the parser tool. While I realize that java.util.StringTokenizer effectively does the same work as WhitespaceTokenizer, it seems odd to use the former when the latter exists. To this end, I'm attaching a patch that adds an additional method public static Parse[] parseLine(String line, Parser parser, Tokenizer tokenizer, int numParses) I've left the existing method public static Parse[] parseLine(String line, Parser parser, int numParses) in for convenience and backwards compatibility. It simply calls the new method with WhitespaceTokenizer.INSTANCE For good measure, I've added a new command-line argument -tk, which takes the name of a tokenizer model. If none is specified, it will fall back on the current behavior of using the whitespace tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)