[
https://issues.apache.org/jira/browse/OPENNLP-67?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jörn Kottmann closed OPENNLP-67.
--------------------------------
Resolution: Fixed
Issue can be closed. Used the html sample from Paul to create a unit test which
tests that training data with html tags is correctly parsed. Thanks for your
help by providing the data Paul.
We will continue you the discussion we started in this issue on the mailing
list since that is a better place to have a more general discussion about how
to train the name finder on html data.
> NameFinderMe detecting organisations in an HTML sample with limited training
> ----------------------------------------------------------------------------
>
> Key: OPENNLP-67
> URL: https://issues.apache.org/jira/browse/OPENNLP-67
> Project: OpenNLP
> Issue Type: Question
> Components: Name Finder
> Affects Versions: tools-1.5.0-sourceforge
> Reporter: Paul
> Attachments: html.patch
>
>
> I have attached a patch named htmltest.patch.
> The patch contains a test named NameFinderMEHtmlTest and 2 embedded resources
> named html1.train and html.html. Obviously html1.train is the training
> sample which is a sample HTML document marked up with <START:organization>
> Org <END> tags. html.html is the same HTML document without the training
> mark up. The HTML has been preprocess with all the line break characters
> removed.
> In the NameFinderMEHtmlTest I am training the data and then using find to
> retrieve the names.
> Was my assumption wrong in thinking that NameFinderME would find the exact
> names from the html? I mean exact in this context because both the training
> html and the test html are the same. The NameFinderMEHtmlTest fails because
> it does not find the first name, it does find part of the name. Is this
> because it has limited training or is the find method performing badly
> against html document?
> I am new to opennlp so there is an element of guess work as to which streams
> etc. I should be using.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.