Re: Problem with entityLinking on Uppercase tokens

Rupert Westenthaler Sun, 02 Jun 2013 22:47:42 -0700

 Hi Joseph,

Sorry for the late response, but I was offline since last Tuesday.

I will try to explain how it happens that the token 'AE' is not
classified as linkable given your example.

On Wed, May 29, 2013 at 7:54 PM, Joseph M'Bimbi-Bene
<jbi...@object-ive.com> wrote:
> Hello everybody, i am having some problems with the EntityhubLinkingEngine.
> I am trying to spot mentions of abbreviations, here is the extract of the
> RDF describing the pathological entity:
>
> <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#AE";>
>     <j.1:name>AE</j.1:name>
>     <dc:description>Acoustic Emission; Architect Engineer </dc:description>
>     <rdf:type rdf:resource="http://www.edf.fr/EdfAcronyme.owl#Acronyme"/>
>     <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#NamedIndividual"/>
>   </rdf:Description>
>
> The language processing is left ot default :
> *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
>

NOTE the min probabilities for POS tags are set to 0.75.

> "Proper Noun Linking" is deactivated.
>
> Here is the text where i try to spot my entity:
>
> "attesté depuis 1480, Knecht (valet) indiquant *AE* une servitude vis-à-vis
> de l’« employeur » et Land"
> Here are some portions of the log of the processing of the tokens:
>
> ProcessingState > 30: Token: [78, 80] *AE *(pos:[Value [pos:
> NC(olia:CommonNoun|olia:Noun)].prob=0.2837978274717965]) chunk: 'none'
>

The token AE is classified as olia:Noun, but the probability 0.28 is
to low that this information is considered for linking (< 0.75).
Because of that this token is treated as if there would be no POS
tagging support.

> 29.05.2013 19:36:41.243 *DEBUG* [Thread-112] ProcessingState - TokenData:
> 'AE'[*linkable=false*(linkabkePos=null)| matchable=true(matchablePos=null)|
> alpha=true| seachLength=false| *upperCase=true*]
>

Because of that "linkable=false", "linkabkePos=null",
matchablePos=null. In case no POS tag is available for a Token the
configured 'Min Token Length'
(enhancer.engines.linking.minSearchTokenLength) is used to decide if a
Token should be considered for searching in the controlled vocabulary.
As you can see in the above the "seachLength=false". Because of that I
assume that you use the default value of this configuration '3'. The
token 'AE' has only a length of '2'.

As specified by STANBOL-1049 upper case tokens with
'TextProcessingConfig#minSearchTokenLength == false' are only marked
matchable and not as linkable. So basically they are just converted
from 'other' tokens (that are ignored by the linking process) to
'matchable' tokens.

If you want to ensure that upper case tokens with two letters are
linked you will need to change the 'Min Token Length' config to '2'.

> Here is the remaining of the logs:
>
> *preocess Token 29: uant AE u* (lemma: null) linkable=true, matchable=true
> | chunk: none
>
> EntityLinker - 28:'di' (lemma: null) linkable=false, matchable=false
>
> EntityLinker + 30:'AE' (lemma: null) linkable=false, matchable=true
>
> EntityLinker >> searchStrings [uant AE u, AE]
>

What Tokenizers do you use* 'uant AE u*' seams to be a strange token.
The logging 'searchStrings [uant AE u, AE]' could indicate that you do
have two tokenizers in the same enhancement chain. This is something
what should definitely be avoided.

E.g. users should make sure to configure '!fr' for the OpenNLP
Tokenizer engine if they do configure Talismane NLP (via RESTful NLP
Analysis Engine) to process french texts.

best
Rupert

> EntityLinker - found 1 entities ...
>
> EntityLinker > http://www.edf.fr/EdfAcronyme.owl#AE (ranking: null)
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> for language null
>
> 29.05.2013 19:36:41.277 *TRACE* [Thread-112]
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> for language null
>
> 29.05.2013 19:36:41.277 *TRACE* [Thread-112]
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> for language null
>
> MainLabelTokenizer - tokenized ae -> [ae]
>
> EntityLinker - no match
>
> *EntityLinker --- preocess Token 33: ser* (lemma: null) linkable=true,
> matchable=true | chunk: none
>
> EntityLinker - 32:'e' (lemma: null) linkable=false, matchable=false
>
> EntityLinker + 34:'servitude' (lemma: null) linkable=true, matchable=true
>
> EntityLinker >> searchStrings [ ser, servitude]
>
> EntityLinker - found 0 entities ...
>
>
> As we can see, "AE" is never processed. What am i doign wrong ? Thank you
> in advance

--
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Problem with entityLinking on Uppercase tokens

Reply via email to