On Thu, Aug 11, 2011 at 11:09 PM, James Kosin <[email protected]> wrote:
> On 8/11/2011 1:19 PM, [email protected] wrote: > >> On Wed, Aug 10, 2011 at 6:23 AM, Jörn Kottmann<[email protected]> >> wrote: >> >> On 8/10/11 2:10 AM, [email protected] wrote: >>> >>> I think it would be much better, but we have different sample classes >>>> (one >>>> for each tool) and no common parent. As far as I can see there is no way >>>> to >>>> compare two samples without knowing the tool and it makes harder to >>>> implement the monitor. That is way I avoided using the sample itself and >>>> added 3 methods that covers different kinds of samples we have. >>>> >>>> Ups, accidentally replied to the issues list. >>> >>> You need to know the sample class, and since they do not have a common >>> parent you always need to write some custom code to extract the knowledge >>> from them. This code we have to write somewhere, now it is in the >>> individual >>> evaluators, but it could also be moved to command line monitors. >>> Extracting this information in the evaluators itself, might be a bit >>> easier >>> since >>> it is going through the samples anyway. >>> >>> So going down this road might be a bit more work, but to me it looks like >>> the >>> solution is also much more useable. >>> >>> Maybe we can leave it to a major release and we will have more >> flexibility >> in what we can do. What do you think? >> Also to me it is more important to improve dictionary creation to avoid >> that >> errors like the one I was having, so I would choose to spend some efforts >> there instead of this. Is it OK? >> >> William, > > If you could change the changes I've already made, I'd be very > appreciative. I'm going to try and expand the testing we are doing now on > the dictionary; but, I'd like some real feedback if at all possible. > Thank you, James. I''ll be able to get back to the dictionary and tagger in a couple of days. The issues I have now is related to model outcomes and the tagset supported by the dictionary. If I use my full dictionary there will be words associated with tags that are not in the model's outcome. It happens when I am using a corpus that don't cover all the range of tags. For example the 4k sentences news corpus I am using does not include any occurrence of a verb present second person singular. That is because of the journalistic style. If the text I am processing has any occurrence of a verb present second person singular it will crash the tagger! To fix that I am thinking about optionally filter the dictionary entries according to the known outcomes, that will be only available after having the model trained by our training tool or by the cross validator. So after training we could iterate over the entries and remove the tags that are unknown by the model. But I am not sure if it is the best approach. William
