Rupert Westenthaler created STANBOL-685:
-------------------------------------------
Summary: Improve POS tag handling of the KeywordLinkingEngine
Key: STANBOL-685
URL: https://issues.apache.org/jira/browse/STANBOL-685
Project: Stanbol
Issue Type: Improvement
Components: Engine - KeywordExtraction
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
Priority: Minor
The KeywordLinkingEngine can make use of POS tags to decide of a Token (word)
needs to be processed or can be skipped. If no POS tags are available or the
POS tag probability is to low (currently the default is 0.8) than the minimum
token length (default is 3) is used as fall-back.
Analyzing POS tag results have shown that often tags with non noun tags where
below the 0.8 limit. For those the fall-back was used and in most cases this
resulted in the KeywordLinkingEngine in processing those tokens.
However it can also be observed that while some of those POS tags where not
correct usually non correct tags where only between tags where both where
non-noun tags. Because of that it can improve results and processing time to
decrease the minimum probability for accepting an non noun POS tag.
Because of that the algorithm will be adjusted like follows:
Introduce two Tag Probabilities:
1. "minPosTypeProb" for Accepting POS tags that represent Nouns and
2. "minPosTypeProb/2" for rejecting POS tags that are not nouns
Assuming that the <code>minPosTypePropb=0.667</code> a<ul>
* noun with the prop 0.8 would result in returning <code>true</code>
* noun with prop 0.5 would return <code>null</code>
* verb with prop 0.4 would return <code>false</code>
* verb with prop 0.3 would return <code>null</code>
NOTES: <code>null</code> indicates that no POS tag is available or the POS tag
has a low propability
This changes will be need to be applied to the
"OpenNlpAnalysedContentFactory#processPOS(..)" and the
"EntityLinker#isProcessableToken(..)" methods
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira