On 3/28/2013 9:54 AM, Ian Jackson wrote:
I used the prebuilt models for the SetenceModel (en-sent.bin), TokenizerModel 
(en-token.bin), and ParserModel (en-parser-chunker.bin) with the following 
sentence:
    The "quick" brown fox jumps in over the lazy dog.

The result marks the part of speech for the quotes as JJ (for the open) and (NN 
for the close) as follows:
(TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS jumps)) 
(PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))

If I alter the sentence as follows changing double quotes to two single forward 
quotes and backward quotes 
[http://www.cis.upenn.edu/~treebank/tokenization.html]:
    The `` quick '' brown fox jumps over the lazy dog

The results are as follows:
(TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) (NNS 
jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))

Does a method exists to configure the tokenizer to handled quotes within a 
sentence?

Training the models with the double quotes instead of the single forward/backward quote would do the trick. Would explain why the tokenizer model doesn't do good with my sentences... I've had to train my own models for a lot of the stuff I'm doing these days.

Thanks,
James

Reply via email to