Re: Problem with entityLinking on Uppercase tokens

Joseph M'Bimbi-Bene Mon, 03 Jun 2013 01:02:50 -0700

Thank you for your answer


2013/6/3 Rupert Westenthaler <[email protected]>

>  Hi Joseph,
>
> Sorry for the late response, but I was offline since last Tuesday.
>
> I will try to explain how it happens that the token 'AE' is not
> classified as linkable given your example.
>
> On Wed, May 29, 2013 at 7:54 PM, Joseph M'Bimbi-Bene
> <[email protected]> wrote:
> > Hello everybody, i am having some problems with the
> EntityhubLinkingEngine.
> > I am trying to spot mentions of abbreviations, here is the extract of the
> > RDF describing the pathological entity:
> >
> > <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#AE";>
> >     <j.1:name>AE</j.1:name>
> >     <dc:description>Acoustic Emission; Architect Engineer
> </dc:description>
> >     <rdf:type rdf:resource="http://www.edf.fr/EdfAcronyme.owl#Acronyme
> "/>
> >     <rdf:type rdf:resource="
> http://www.w3.org/2002/07/owl#NamedIndividual"/>
> >   </rdf:Description>
> >
> > The language processing is left ot default :
> > *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
> >
>
> NOTE the min probabilities for POS tags are set to 0.75.
>
> > "Proper Noun Linking" is deactivated.
> >
> > Here is the text where i try to spot my entity:
> >
> > "attesté depuis 1480, Knecht (valet) indiquant *AE* une servitude
> vis-à-vis
> > de l’« employeur » et Land"
> > Here are some portions of the log of the processing of the tokens:
> >
> > ProcessingState > 30: Token: [78, 80] *AE *(pos:[Value [pos:
> > NC(olia:CommonNoun|olia:Noun)].prob=0.2837978274717965]) chunk: 'none'
> >
>
> The token AE is classified as olia:Noun, but the probability 0.28 is
> to low that this information is considered for linking (< 0.75).
> Because of that this token is treated as if there would be no POS
> tagging support.
>
> > 29.05.2013 19:36:41.243 *DEBUG* [Thread-112] ProcessingState - TokenData:
> > 'AE'[*linkable=false*(linkabkePos=null)|
> matchable=true(matchablePos=null)|
> > alpha=true| seachLength=false| *upperCase=true*]
> >
>
> Because of that "linkable=false", "linkabkePos=null",
> matchablePos=null. In case no POS tag is available for a Token the
> configured 'Min Token Length'
> (enhancer.engines.linking.minSearchTokenLength) is used to decide if a
> Token should be considered for searching in the controlled vocabulary.
> As you can see in the above the "seachLength=false". Because of that I
> assume that you use the default value of this configuration '3'. The
> token 'AE' has only a length of '2'.
>
> As specified by STANBOL-1049 upper case tokens with
> 'TextProcessingConfig#minSearchTokenLength == false' are only marked
> matchable and not as linkable. So basically they are just converted
> from 'other' tokens (that are ignored by the linking process) to
> 'matchable' tokens.
>
> If you want to ensure that upper case tokens with two letters are
> linked you will need to change the 'Min Token Length' config to '2'.
>
>
> > Here is the remaining of the logs:
> >
> > *preocess Token 29: uant AE u* (lemma: null) linkable=true,
> matchable=true
> > | chunk: none
> >
> > EntityLinker - 28:'di' (lemma: null) linkable=false, matchable=false
> >
> > EntityLinker + 30:'AE' (lemma: null) linkable=false, matchable=true
> >
> > EntityLinker >> searchStrings [uant AE u, AE]
> >
>
> What Tokenizers do you use* 'uant AE u*' seams to be a strange token.
> The logging 'searchStrings [uant AE u, AE]' could indicate that you do
> have two tokenizers in the same enhancement chain. This is something
> what should definitely be avoided.
>
> E.g. users should make sure to configure '!fr' for the OpenNLP
> Tokenizer engine if they do configure Talismane NLP (via RESTful NLP
> Analysis Engine) to process french texts.
>
>
>
I think it is the tokenizing process of Talismane NLP, since my enhancement
chain is :
-langdetect
-talismaneNLP
-MyVocabulary


> best
> Rupert
>
>
> > EntityLinker - found 1 entities ...
> >
> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#AE (ranking: null)
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 29.05.2013 19:36:41.277 *TRACE* [Thread-112]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 29.05.2013 19:36:41.277 *TRACE* [Thread-112]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> > for language null
> >
> > MainLabelTokenizer - tokenized ae -> [ae]
> >
> > EntityLinker - no match
> >
> > *EntityLinker --- preocess Token 33: ser* (lemma: null) linkable=true,
> > matchable=true | chunk: none
> >
> > EntityLinker - 32:'e' (lemma: null) linkable=false, matchable=false
> >
> > EntityLinker + 34:'servitude' (lemma: null) linkable=true, matchable=true
> >
> > EntityLinker >> searchStrings [ ser, servitude]
> >
> > EntityLinker - found 0 entities ...
> >
> >
> > As we can see, "AE" is never processed. What am i doign wrong ? Thank you
> > in advance
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Problem with entityLinking on Uppercase tokens

Reply via email to