On 2/23/2011 8:18 AM, Rohana Rajapakse wrote: > Hi James, > > It works for double quotes, but not for single quotes (i.e. fails for > 'mistakes'). Is it a training issue then (not having cases with words > enclosed within single/double quotes. > > I have noticed that your model file is much smaller than the model file > available to download. Is it because your training data set is smaller? > How does it affect tokenizing overall? > > Are there training sets available to download? > > Regards > > Rohana > > Rohana,
It doesn't take many to produce a good model. I was only doing for simple cases as is since many of the models have no training data that is freely available. Even if you get, most will have to be hand parsed into the correct format and tokenized by hand in most cases. The tokenizer is expecting the sentence detector to get the data first, generating a sentence on each line. My model was created with about 75 sentences in total. I only added two to help with the "" characters around words. The only time I'd expect single-quotes would be quotes of quotes... which doesn't happen very often. I think the issue was not having any single words that had quotes around them that caused the issue, in the context you brought up the quotes around the "mistakes" isn't a direct quote but using the word out of context to represent a thought or idea outside of how the word would normally be used. ie: the military doesn't make mistakes since they are in the business of war which is messy by design. I don't think the small sampling of text is really bad, I have to train the model a bit differently; but, even my model had issues with the same sentences you had given; meaning it was at least as good as the larger model. There may have even been some the larger model can get that my model wouldn't get correct as well. As far as the other model, I can't comment too much other than most of the models are trained on news stories and not text documents. It is really based on what you are looking to parse; however a tokenizer is a fairly simple thing to train. James
