Rupert Westenthaler created STANBOL-1117:
--------------------------------------------

             Summary: Use POS tag information for better selection of search 
tokens for EntityLookups
                 Key: STANBOL-1117
                 URL: https://issues.apache.org/jira/browse/STANBOL-1117
             Project: Stanbol
          Issue Type: Sub-task
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


Currently EntityLinking determines Tokens used for lookups in the controlled 
vocabularies like follows

* start from a "linkable" Token
* search surrounding Tokens for other "linkable" or "matchable" Tokens
* until "Max Search Token Distance" (default 3 Tokens) or
* more than one non "matchable" Token was found
* Max Search Tokens (default 2 Tokens) are selected but
* never use Tokes earlier as the last consumed (already linked) tokens
* in the case of explicitly annotated Chunks the selection of search tokens is 
in addition limited by those chunks

This Issue will try to improve this algorithm by considering

* "Linkable" and "matchable" Tokens
* Tokens with "chunkable" POS annotations

when selecting search Tokens. This will allow better selection of search tokens 
in cases where not Chunker (NounPhrase detection and/or NER) are present.

With this in place it need to be checked if increasing the default "Max Search 
Tokens" could lead to better results and possible performance - if one query 
could be used to link multiple Entities for non overlapping spans).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to