That is a big help.

I will try a few of these ideas out and see how I get on.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 25 January 2011 14:41, Jörn Kottmann <[email protected]> wrote:

> On 1/25/11 3:22 PM, Paul Cowan wrote:
>
>> Hi,
>>
>> Thanks for your comments on the JIRA.
>>
>> Should I be expecting exact results if the training data and the sample
>> data
>> are exactly the same or is there just too little training data to tell at
>> this stage?
>>
>>
> If you are training with a cutoff of 5 then the results might not be
> identical,
> and even if they are, you want good results on "unkown" data.
>
> That is why you need a certain a mount of training data to get the model
> going.
>
> When we have natural language text we divide it into sentences to extract a
> unit
> we can pass on to the name finder. For me it seems that is more difficult
> to
> get such a unit when working directly on html data. In your case I think
> the previous
> map feature does not really help. So you could pass a bigger chunk to the
> find method than you
> usually would do.
>
> Maybe even an entire page you crawl at a time. But then you need to have
> good way of
> tokenizing this page, because your tokenization should take the html into
> account, having
> an html element as a token would make sense in my eyes. But you could also
> try to just
> use the simple tokenizer and play a little with the feature generation,
> e.g. increasing the
> window size to 5 or even more.
>
> After you have this you still need to annotate training data, which might
> not be that nice
> with our "text" format, because it would mean that you have to place an
> entire page into
> one line.
>
> But it should not hard to come up with a new format, then you write a small
> parser
> and create the NameSample object yourself.
>
> Hope that helps,
> Jörn
>
>

Reply via email to