I meant that I have written an HtmlToeknizer and not an html parser.

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 31 January 2011 14:52, Paul Cowan <[email protected]> wrote:

> I have written an html parser which I am using to tokenize an html document
> (new line characters removed) and pass into the find method of NameFinderME.
>
> I am getting good results for some basic model testing on identically
> trained html and sample html (without the <START:organization>...<END> tags
> and different company names).
>
> When it comes to training the model, I am calling the static train method
> of the NameFinderME.
>
> I have noticed that the tokenization of the training data happens in the
> read method of NameSampleDataStream which in turn calls the static parse
> method of NameSample.
>
> This method uses the WhitespaceTokenizer to tokenize.
>
> Am I right in saying that I should be using the same tokenizer for both
> training and finding?
>
> Should I write something to take care of the NameSample object creation
> that uses my HtmlTokenizer or maybe it makes sense to extend the
> NamesampleDataStream to allow for the use of other tokenizers?
>
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 26 January 2011 04:33, Khurram <[email protected]> wrote:
>
>> i am trying to find out what is the corelation between the amount of
>> training data and the accuracy of find calls. In other words, at what
>> point
>> adding more training data starts to matter less and less and we run into
>> deminishing returns...
>>
>> one more thing: it would be nice to see something like a Statistic object
>> populated after finder.train to see how well you have trained the model.
>>
>> thanks,
>>
>> On Tue, Jan 25, 2011 at 8:41 AM, Jörn Kottmann <[email protected]>
>> wrote:
>>
>> > On 1/25/11 3:22 PM, Paul Cowan wrote:
>> >
>> >> Hi,
>> >>
>> >> Thanks for your comments on the JIRA.
>> >>
>> >> Should I be expecting exact results if the training data and the sample
>> >> data
>> >> are exactly the same or is there just too little training data to tell
>> at
>> >> this stage?
>> >>
>> >>
>> > If you are training with a cutoff of 5 then the results might not be
>> > identical,
>> > and even if they are, you want good results on "unkown" data.
>> >
>> > That is why you need a certain a mount of training data to get the model
>> > going.
>> >
>> > When we have natural language text we divide it into sentences to
>> extract a
>> > unit
>> > we can pass on to the name finder. For me it seems that is more
>> difficult
>> > to
>> > get such a unit when working directly on html data. In your case I think
>> > the previous
>> > map feature does not really help. So you could pass a bigger chunk to
>> the
>> > find method than you
>> > usually would do.
>> >
>> > Maybe even an entire page you crawl at a time. But then you need to have
>> > good way of
>> > tokenizing this page, because your tokenization should take the html
>> into
>> > account, having
>> > an html element as a token would make sense in my eyes. But you could
>> also
>> > try to just
>> > use the simple tokenizer and play a little with the feature generation,
>> > e.g. increasing the
>> > window size to 5 or even more.
>> >
>> > After you have this you still need to annotate training data, which
>> might
>> > not be that nice
>> > with our "text" format, because it would mean that you have to place an
>> > entire page into
>> > one line.
>> >
>> > But it should not hard to come up with a new format, then you write a
>> small
>> > parser
>> > and create the NameSample object yourself.
>> >
>> > Hope that helps,
>> > Jörn
>> >
>> >
>>
>
>

Reply via email to