Tristan Nixon created OPENNLP-857:
-------------------------------------

             Summary: ParserTool should take use Tokenizer instance. It should 
not use java.util.StringTokenizer
                 Key: OPENNLP-857
                 URL: https://issues.apache.org/jira/browse/OPENNLP-857
             Project: OpenNLP
          Issue Type: Improvement
          Components: Parser
    Affects Versions: 1.6.0
            Reporter: Tristan Nixon


It would be nice if the ParserTool would make use of a real tokenizer. In 
addition to being the "right" thing to do, it would obviate issues like 
OPENNLP-240 when using the parser tool.

While I realize that java.util.StringTokenizer effectively does the same work 
as WhitespaceTokenizer, it seems odd to use the former when the latter exists.

To this end, I'm attaching a patch that adds an additional method
public static Parse[] parseLine(String line, Parser parser, Tokenizer 
tokenizer, int numParses)

I've left the existing method
public static Parse[] parseLine(String line, Parser parser, int numParses)
in for convenience and backwards compatibility. It simply calls the new method 
with WhitespaceTokenizer.INSTANCE

For good measure, I've added a new command-line argument -tk, which takes the 
name of a tokenizer model. If none is specified, it will fall back on the 
current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to