In my opinion the tokenizer is working properly and the issue is with the
quotes, wich are unknown by the parser model. I would preprocess the
tokenized text, replacing the quotes by the one known by the model, wich
follows the treebank convention.


On Thu, Mar 28, 2013 at 11:13 PM, James Kosin <[email protected]> wrote:

> On 3/28/2013 9:54 AM, Ian Jackson wrote:
>
>> I used the prebuilt models for the SetenceModel (en-sent.bin),
>> TokenizerModel (en-token.bin), and ParserModel (en-parser-chunker.bin) with
>> the following sentence:
>>     The "quick" brown fox jumps in over the lazy dog.
>>
>> The result marks the part of speech for the quotes as JJ (for the open)
>> and (NN for the close) as follows:
>> (TOP (NP (NP (DT The) (JJ ") (JJ quick) (NN ") (JJ brown) (NN fox) (NNS
>> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
>>
>> If I alter the sentence as follows changing double quotes to two single
>> forward quotes and backward quotes [http://www.cis.upenn.edu/~**
>> treebank/tokenization.html<http://www.cis.upenn.edu/~treebank/tokenization.html>
>> ]:
>>     The `` quick '' brown fox jumps over the lazy dog
>>
>> The results are as follows:
>> (TOP (NP (NP (DT The) (`` ``) (JJ quick) ('' '') (JJ brown) (NN fox) (NNS
>> jumps)) (PP (IN over) (NP (DT the) (JJ lazy) (NN dog))) (. .)))
>>
>> Does a method exists to configure the tokenizer to handled quotes within
>> a sentence?
>>
>>  Training the models with the double quotes instead of the single
> forward/backward quote would do the trick.
> Would explain why the tokenizer model doesn't do good with my sentences...
>  I've had to train my own models for a lot of the stuff I'm doing these
> days.
>
> Thanks,
> James
>

Reply via email to