Thanks Jörn. I can see in the code that depending on the operation a token will be merged to the left or right token. But I can't see where (in the code) it adds a <SPLIT> token. Can you please point me to the right place in the code?
Thanks Rohana -----Original Message----- From: Jörn Kottmann [mailto:[email protected]] Sent: 04 March 2011 15:05 To: [email protected] Subject: Re: Tokenizer issue - Quotation marks On 3/4/11 3:46 PM, Rohana Rajapakse wrote: > That works great. It was not clear to me where/how the detokenizer rules are > used. I thought it's for combining a given token to the one before or after > it. Exactly, the converter detokenizes your input tokens with the detokenizer, now it knows which tokens will be merged together and is able to add the <SPLIT> tag between the token instead of just cat the tokens together. > I can see in the training file that it has added<SPLIT> tags before "'s". > Can I include spaces in the rule (e.g. "'s " note the trailing space). This > will make it explicit. Not sure I understand you correctly here. I mean we already have tokenized data, the tokens do not contain any information about the spaces between them. Lets say we have these two strings: 1: "A sample" 2: "A sample" Now we use a white space tokenizer to tokenize them and the result would be like this: 1: "A", "sample" 2: "A", "sample" In the token representation we do not have any white spaces anymore. Jörn GOSS community User Group for clients. Sign-up here: www.gossinteractive.com/usergroup Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, Plymouth, PL1 1LG. Company Registration No: 3553908 This email contains proprietary information, some or all of which may be legally privileged. It is for the intended recipient only. If an addressing or transmission error has misdirected this email, please notify the author by replying to this email. If you are not the intended recipient you may not use, disclose, distribute, copy, print or rely on this email. Email transmission cannot be guaranteed to be secure or error free, as information may be intercepted, corrupted, lost, destroyed, arrive late or incomplete or contain viruses. This email and any files attached to it have been checked with virus detection software before transmission. You should nonetheless carry out your own virus check before opening any attachment. GOSS Interactive Ltd accepts no liability for any loss or damage that may be caused by software viruses.
