That works great. It was not clear to me where/how the detokenizer rules are 
used. I thought it's for combining a given token to the one before or after it. 
I can see in the training file that it has added <SPLIT> tags before "'s".
Can I include spaces in the rule (e.g. "'s " note the trailing space). This 
will make it explicit.


Thanks

Rohana


-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: 04 March 2011 12:34
To: [email protected]
Subject: Re: Tokenizer issue - Quotation marks

On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
> Hi Jorn,
>
>
>
> I have modified the toString() method in TokenSample.java as given below. 
> This is to add a<SPLIT>  token before the token 's .
>
> This helped me to train a tokenizer model that splits for eg. "it's" into two 
> tokens "it" and "'s" at the same time detokenizer rule (same as the rule for 
> double quote) splitting single quotes from expressions that are enclosed in 
> between a pair of single quotes.
>
> This does not handle other cases of single quotes (e.g. don't, can't etc and 
> names like O'Conner).
>
Had a look at the change. The tokenization information must be provided
to TokenSample, this class
then just encapsulates that knowledge. So it is not the responsibility
of it to figure out how
things should be tokenized or not.

In your case I think you can just add "'s" to your detokenizer
dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>

Doesn't that fix your issue?

Jörn



GOSS community User Group for clients. Sign-up here: 
www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.



Reply via email to