That is a big help. I will try a few of these ideas out and see how I get on.
Cheers Paul Cowan Cutting-Edge Solutions (Scotland) http://thesoftwaresimpleton.blogspot.com/ On 25 January 2011 14:41, Jörn Kottmann <[email protected]> wrote: > On 1/25/11 3:22 PM, Paul Cowan wrote: > >> Hi, >> >> Thanks for your comments on the JIRA. >> >> Should I be expecting exact results if the training data and the sample >> data >> are exactly the same or is there just too little training data to tell at >> this stage? >> >> > If you are training with a cutoff of 5 then the results might not be > identical, > and even if they are, you want good results on "unkown" data. > > That is why you need a certain a mount of training data to get the model > going. > > When we have natural language text we divide it into sentences to extract a > unit > we can pass on to the name finder. For me it seems that is more difficult > to > get such a unit when working directly on html data. In your case I think > the previous > map feature does not really help. So you could pass a bigger chunk to the > find method than you > usually would do. > > Maybe even an entire page you crawl at a time. But then you need to have > good way of > tokenizing this page, because your tokenization should take the html into > account, having > an html element as a token would make sense in my eyes. But you could also > try to just > use the simple tokenizer and play a little with the feature generation, > e.g. increasing the > window size to 5 or even more. > > After you have this you still need to annotate training data, which might > not be that nice > with our "text" format, because it would mean that you have to place an > entire page into > one line. > > But it should not hard to come up with a new format, then you write a small > parser > and create the NameSample object yourself. > > Hope that helps, > Jörn > >
