[
https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984030#action_12984030
]
Paul commented on OPENNLP-67:
-----------------------------
Thank you for your detailed feedback Jörn. I do agree that the training data
file is too large and I will trim it down before resubmitting.
I will also get a better understanding of both tokenization and feature
generation before resubmitting the patch.
One thing I am unsure about is how to break up html file for the name finder.
Sentence detection using a model like the en-sent.bin will obviously not work
on html, would I need to train my own model or should I look at doing this
programmatically?
Could you recommended a strategy for breaking up the html?
> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
> Key: OPENNLP-67
> URL: https://issues.apache.org/jira/browse/OPENNLP-67
> Project: OpenNLP
> Issue Type: Question
> Components: Name Finder
> Affects Versions: tools-1.5.0-sourceforge
> Reporter: Paul
> Attachments: htmltest.patch
>
>
> I have attached a patch named htmltest.patch.
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources
> named html1.train and html.html. Obviously html1.train is the training
> sample which is a sample HTML document marked up with <START:organization>
> Org <END> tags. html.html is the same HTML document without the training
> mark up. The HTML has been preprocess with all the line break characters
> removed.
> In the NameFinderMEHtmlTest I am training the data and then using find to
> retrieve the names.
> Was my assumption wrong in thinking that NameFinderME would find the exact
> names from the html? I mean exact in this context because both the training
> html and the test html are the same. The NameFinderMEHtmlTest fails because
> it does not find the first name, it does find part of the name. Is this
> because it has limited training or is the find method performing badly
> against html document?
> I am new to opennlp so there is an element of guess work as to which streams
> etc. I should be using.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.