i am trying to find out what is the corelation between the amount of
training data and the accuracy of find calls. In other words, at what point
adding more training data starts to matter less and less and we run into
deminishing returns...

one more thing: it would be nice to see something like a Statistic object
populated after finder.train to see how well you have trained the model.

thanks,

On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <[email protected]> wrote:

> On 1/25/11 3:22 PM, Paul Cowan wrote:
>
>> Hi,
>>
>> Thanks for your comments on the JIRA.
>>
>> Should I be expecting exact results if the training data and the sample
>> data
>> are exactly the same or is there just too little training data to tell at
>> this stage?
>>
>>
> If you are training with a cutoff of 5 then the results might not be
> identical,
> and even if they are, you want good results on "unkown" data.
>
> That is why you need a certain a mount of training data to get the model
> going.
>
> When we have natural language text we divide it into sentences to extract a
> unit
> we can pass on to the name finder. For me it seems that is more difficult
> to
> get such a unit when working directly on html data. In your case I think
> the previous
> map feature does not really help. So you could pass a bigger chunk to the
> find method than you
> usually would do.
>
> Maybe even an entire page you crawl at a time. But then you need to have
> good way of
> tokenizing this page, because your tokenization should take the html into
> account, having
> an html element as a token would make sense in my eyes. But you could also
> try to just
> use the simple tokenizer and play a little with the feature generation,
> e.g. increasing the
> window size to 5 or even more.
>
> After you have this you still need to annotate training data, which might
> not be that nice
> with our "text" format, because it would mean that you have to place an
> entire page into
> one line.
>
> But it should not hard to come up with a new format, then you write a small
> parser
> and create the NameSample object yourself.
>
> Hope that helps,
> Jörn
>
>

Reply via email to