I meant that I have written an HtmlToeknizer and not an html parser. Cheers
Paul Cowan Cutting-Edge Solutions (Scotland) http://thesoftwaresimpleton.blogspot.com/ On 31 January 2011 14:52, Paul Cowan <[email protected]> wrote: > I have written an html parser which I am using to tokenize an html document > (new line characters removed) and pass into the find method of NameFinderME. > > I am getting good results for some basic model testing on identically > trained html and sample html (without the <START:organization>...<END> tags > and different company names). > > When it comes to training the model, I am calling the static train method > of the NameFinderME. > > I have noticed that the tokenization of the training data happens in the > read method of NameSampleDataStream which in turn calls the static parse > method of NameSample. > > This method uses the WhitespaceTokenizer to tokenize. > > Am I right in saying that I should be using the same tokenizer for both > training and finding? > > Should I write something to take care of the NameSample object creation > that uses my HtmlTokenizer or maybe it makes sense to extend the > NamesampleDataStream to allow for the use of other tokenizers? > > > Cheers > > Paul Cowan > > Cutting-Edge Solutions (Scotland) > > http://thesoftwaresimpleton.blogspot.com/ > > > > On 26 January 2011 04:33, Khurram <[email protected]> wrote: > >> i am trying to find out what is the corelation between the amount of >> training data and the accuracy of find calls. In other words, at what >> point >> adding more training data starts to matter less and less and we run into >> deminishing returns... >> >> one more thing: it would be nice to see something like a Statistic object >> populated after finder.train to see how well you have trained the model. >> >> thanks, >> >> On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <[email protected]> >> wrote: >> >> > On 1/25/11 3:22 PM, Paul Cowan wrote: >> > >> >> Hi, >> >> >> >> Thanks for your comments on the JIRA. >> >> >> >> Should I be expecting exact results if the training data and the sample >> >> data >> >> are exactly the same or is there just too little training data to tell >> at >> >> this stage? >> >> >> >> >> > If you are training with a cutoff of 5 then the results might not be >> > identical, >> > and even if they are, you want good results on "unkown" data. >> > >> > That is why you need a certain a mount of training data to get the model >> > going. >> > >> > When we have natural language text we divide it into sentences to >> extract a >> > unit >> > we can pass on to the name finder. For me it seems that is more >> difficult >> > to >> > get such a unit when working directly on html data. In your case I think >> > the previous >> > map feature does not really help. So you could pass a bigger chunk to >> the >> > find method than you >> > usually would do. >> > >> > Maybe even an entire page you crawl at a time. But then you need to have >> > good way of >> > tokenizing this page, because your tokenization should take the html >> into >> > account, having >> > an html element as a token would make sense in my eyes. But you could >> also >> > try to just >> > use the simple tokenizer and play a little with the feature generation, >> > e.g. increasing the >> > window size to 5 or even more. >> > >> > After you have this you still need to annotate training data, which >> might >> > not be that nice >> > with our "text" format, because it would mean that you have to place an >> > entire page into >> > one line. >> > >> > But it should not hard to come up with a new format, then you write a >> small >> > parser >> > and create the NameSample object yourself. >> > >> > Hope that helps, >> > Jörn >> > >> > >> > >
