[jira] [Commented] (STANBOL-1049) Add support for Upper Case Linking for Languages without NLP support

Rupert Westenthaler (JIRA) Tue, 23 Apr 2013 23:17:18 -0700

    [ 
https://issues.apache.org/jira/browse/STANBOL-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640146#comment-13640146
 ]


Rupert Westenthaler commented on STANBOL-1049:
----------------------------------------------

instead of using the 'linkUpperCaseTokens' properties to enable/disable the 
improved algorithm a own configuration property that allows to enable/disable 
this feature will be introduced. The property will use the key:

     enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag

NOTE: that this option will be disabled for all languages that do use unicase 
scripts - scripts that do not use upper case letters. This includes languages 
like Arabic, Hebrew, Hindu, Chinese, Japanese, Korean and Georgian.

The documentation of the EntityLinkingEngine will be changed to include an own 
section for "Processing Tokens without POS tag". This Section will explain 
details about the improved algorithms and contain information about the

* enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag and
* enhancer.engines.linking.minSearchTokenLength

configuration options.


                
> Add support for Upper Case Linking for Languages without NLP support
> --------------------------------------------------------------------
>
>                 Key: STANBOL-1049
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1049
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Enhancement Engines
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> This issue will allow the EntityLinkingEngine to use upper case token 
> information for linking of languages without NLP support. 
> If TextProcessingConfig#LinkUpperCaseTokens is enabled only upper case tokens 
> that are equals or longer than the configured min search token length will be 
> linked with the controlled vocabulary. Lower case Tokens equals or longer 
> than the min search token length will be used for matching.
> Deactivating TextProcessingConfig#LinkUpperCaseTokens will preserve the 
> current behavior  where all Tokens with equals or more chars as the 
> configured min search token length will be linked.
> NOTE: that this will require to explicitly configure 
>     {lang};uc=MATCH
> for languages that do not upper case characters (e.g. Arabic)
> ---
> Definitions:
> -------
> The EntityLinking Engine distinguishes three (Token Types)[1]:
> * Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
> * Matchable Token: A Word that is used to search and match Entities, but does 
> not trigger an lookup
> * Other Tokens: Not used for search and matching. Might be used for fine 
> tuning confidence values.
> Token level information include
> * hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
> * hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
> * isUpperCase [true,false]: If the first letter is an upper case one
> * hasAlphaNumeric [true,false]: if the word has an alpha numeric char
> * hasSearchableLength [true,false]: if the word is longer as the configured 
> "Min Search Token Length"
> * isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.
> Algorithm:
> ------
> This describes the algorithm used to classify Tokens as linkable, matchable 
> and other based on the above properties. Rules are applied in the given 
> order. A summary of the result for Tokens with no POS tags is given in the 
> next section 
> __(1) Basic rules:__
> * all Tokens with hasAlphaNumeric == false are not linkage and matchable
> * all tokens with hasLinkablePos are linkable
> * all linkable tokens, tokens with matchable POS or a searchableLength are 
> matchable
> __(2) Uppercase Rules__
> This rules are applied to all none linkable token that are (1) upper case and 
> (2) not at a sentence or subSentence start
> * if TextProcessingConfig#LinkUpperCaseTokens is enabled
>     * all matchable Tokens are also linkable
>     * all other Tokens are converted to matchable
> * if TextProcessingConfig#MatchUpperCaseTokens is enable
>     * all other Tokens are converted to matchable
>     * all Tokens with linkablePos == null and searchableLength are converted 
> to linkable
> __(3) Searchable Token Rules__
> This rules are only applied to not linkable Tokens with hasLinkablePos == 
> null and hasMatchablePos == null
> * if  TextProcessingConfig#LinkUpperCaseTokens == false
>     * all Tokens with searchableLength are marked as linkable
> Languages without NLP support
> -----
> The above algorithm ensures that for languages without NLP support (no POS 
> tags) Tokens are marked as follows:
> __ LinkUpperCaseTokens is enabled __
> * Linkable: All upper case tokens with a searchable length
> * Matchable: All upper case tokens shorter as the min searchable length; All 
> lower case tokens with a searchable length
> * Other Tokens: All lower case tokens shorter as the min searchable length
> __ LinkUpperCaseTokens is disabled __
> * Linkable: All tokens with a searchable length
> * Other Tokens: All tokens shorter as the min searchable length
> [1] 
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-1049) Add support for Upper Case Linking for Languages without NLP support

Reply via email to