Rupert Westenthaler created STANBOL-685:
-------------------------------------------

             Summary: Improve POS tag handling of the KeywordLinkingEngine
                 Key: STANBOL-685
                 URL: https://issues.apache.org/jira/browse/STANBOL-685
             Project: Stanbol
          Issue Type: Improvement
          Components: Engine - KeywordExtraction
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler
            Priority: Minor


The KeywordLinkingEngine can make use of POS tags to decide of a Token (word) 
needs to be processed or can be skipped. If no POS tags are available or the 
POS tag probability is to low (currently the default is 0.8) than the minimum 
token length (default is 3) is used as fall-back.

Analyzing POS tag results have shown that often tags with non noun tags where 
below the 0.8 limit. For those the fall-back was used and in most cases this 
resulted in the KeywordLinkingEngine in processing those tokens.

However it can also be observed that while some of those POS tags where not 
correct usually non correct tags where only between tags where both where 
non-noun tags. Because of that it can improve results and processing time to 
decrease the minimum probability for accepting an non noun POS tag.

Because of that the algorithm will be adjusted like follows:

Introduce two Tag Probabilities:

1. "minPosTypeProb" for Accepting POS tags that represent Nouns and
2. "minPosTypeProb/2" for rejecting POS tags that are not nouns

Assuming that the <code>minPosTypePropb=0.667</code> a<ul>

 * noun with the prop 0.8 would result in returning <code>true</code>
 * noun with prop 0.5 would return <code>null</code>
 * verb with prop 0.4 would return <code>false</code>
 * verb with prop 0.3 would return <code>null</code>

NOTES: <code>null</code> indicates that no POS tag is available or the POS tag 
has a low propability

This changes will be need to be applied to the 
"OpenNlpAnalysedContentFactory#processPOS(..)" and the 
"EntityLinker#isProcessableToken(..)" methods

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to