Re: Tokenizer issue - Quotation marks

Jörn Kottmann Fri, 04 Mar 2011 07:05:28 -0800

On 3/4/11 3:46 PM, Rohana Rajapakse wrote:

That works great. It was not clear to me where/how the detokenizer rules are 
used. I thought it's for combining a given token to the one before or after it.

Exactly, the converter detokenizes your input tokens with thedetokenizer, now it knows which tokens will be mergedtogether and is able to add the <SPLIT> tag between the token instead ofjust cat the tokens together.

I can see in the training file that it has added<SPLIT>  tags before "'s".
Can I include spaces in the rule (e.g. "'s " note the trailing space). This 
will make it explicit.

Not sure I understand you correctly here. I mean we already havetokenized data, the tokens do not contain

any information about the spaces between them.

Lets say we have these two strings:

1: "A    sample"
2: "A sample"

Now we use a white space tokenizer to tokenize them and
the result would be like this:
1: "A", "sample"
2: "A", "sample"

In the token representation we do not have any white spaces anymore.

Jörn

Re: Tokenizer issue - Quotation marks

Reply via email to