On 3/4/11 3:46 PM, Rohana Rajapakse wrote:
That works great. It was not clear to me where/how the detokenizer rules are
used. I thought it's for combining a given token to the one before or after it.
Exactly, the converter detokenizes your input tokens with the
detokenizer, now it knows which tokens will be merged
together and is able to add the <SPLIT> tag between the token instead of
just cat the tokens together.
I can see in the training file that it has added<SPLIT> tags before "'s".
Can I include spaces in the rule (e.g. "'s " note the trailing space). This
will make it explicit.
Not sure I understand you correctly here. I mean we already have
tokenized data, the tokens do not contain
any information about the spaces between them.
Lets say we have these two strings:
1: "A sample"
2: "A sample"
Now we use a white space tokenizer to tokenize them and
the result would be like this:
1: "A", "sample"
2: "A", "sample"
In the token representation we do not have any white spaces anymore.
Jörn