Thank you for your answer
2013/6/3 Rupert Westenthaler <[email protected]> > Hi Joseph, > > Sorry for the late response, but I was offline since last Tuesday. > > I will try to explain how it happens that the token 'AE' is not > classified as linkable given your example. > > On Wed, May 29, 2013 at 7:54 PM, Joseph M'Bimbi-Bene > <[email protected]> wrote: > > Hello everybody, i am having some problems with the > EntityhubLinkingEngine. > > I am trying to spot mentions of abbreviations, here is the extract of the > > RDF describing the pathological entity: > > > > <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#AE"> > > <j.1:name>AE</j.1:name> > > <dc:description>Acoustic Emission; Architect Engineer > </dc:description> > > <rdf:type rdf:resource="http://www.edf.fr/EdfAcronyme.owl#Acronyme > "/> > > <rdf:type rdf:resource=" > http://www.w3.org/2002/07/owl#NamedIndividual"/> > > </rdf:Description> > > > > The language processing is left ot default : > > *;lmmtip;uc=LINK;prop=0.75;pprob=0.75 > > > > NOTE the min probabilities for POS tags are set to 0.75. > > > "Proper Noun Linking" is deactivated. > > > > Here is the text where i try to spot my entity: > > > > "attesté depuis 1480, Knecht (valet) indiquant *AE* une servitude > vis-à-vis > > de l’« employeur » et Land" > > Here are some portions of the log of the processing of the tokens: > > > > ProcessingState > 30: Token: [78, 80] *AE *(pos:[Value [pos: > > NC(olia:CommonNoun|olia:Noun)].prob=0.2837978274717965]) chunk: 'none' > > > > The token AE is classified as olia:Noun, but the probability 0.28 is > to low that this information is considered for linking (< 0.75). > Because of that this token is treated as if there would be no POS > tagging support. > > > 29.05.2013 19:36:41.243 *DEBUG* [Thread-112] ProcessingState - TokenData: > > 'AE'[*linkable=false*(linkabkePos=null)| > matchable=true(matchablePos=null)| > > alpha=true| seachLength=false| *upperCase=true*] > > > > Because of that "linkable=false", "linkabkePos=null", > matchablePos=null. In case no POS tag is available for a Token the > configured 'Min Token Length' > (enhancer.engines.linking.minSearchTokenLength) is used to decide if a > Token should be considered for searching in the controlled vocabulary. > As you can see in the above the "seachLength=false". Because of that I > assume that you use the default value of this configuration '3'. The > token 'AE' has only a length of '2'. > > As specified by STANBOL-1049 upper case tokens with > 'TextProcessingConfig#minSearchTokenLength == false' are only marked > matchable and not as linkable. So basically they are just converted > from 'other' tokens (that are ignored by the linking process) to > 'matchable' tokens. > > If you want to ensure that upper case tokens with two letters are > linked you will need to change the 'Min Token Length' config to '2'. > > > > Here is the remaining of the logs: > > > > *preocess Token 29: uant AE u* (lemma: null) linkable=true, > matchable=true > > | chunk: none > > > > EntityLinker - 28:'di' (lemma: null) linkable=false, matchable=false > > > > EntityLinker + 30:'AE' (lemma: null) linkable=false, matchable=true > > > > EntityLinker >> searchStrings [uant AE u, AE] > > > > What Tokenizers do you use* 'uant AE u*' seams to be a strange token. > The logging 'searchStrings [uant AE u, AE]' could indicate that you do > have two tokenizers in the same enhancement chain. This is something > what should definitely be avoided. > > E.g. users should make sure to configure '!fr' for the OpenNLP > Tokenizer engine if they do configure Talismane NLP (via RESTful NLP > Analysis Engine) to process french texts. > > > I think it is the tokenizing process of Talismane NLP, since my enhancement chain is : -langdetect -talismaneNLP -MyVocabulary > best > Rupert > > > > EntityLinker - found 1 entities ... > > > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#AE (ranking: null) > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 29.05.2013 19:36:41.277 *TRACE* [Thread-112] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 29.05.2013 19:36:41.277 *TRACE* [Thread-112] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > > for language null > > > > MainLabelTokenizer - tokenized ae -> [ae] > > > > EntityLinker - no match > > > > *EntityLinker --- preocess Token 33: ser* (lemma: null) linkable=true, > > matchable=true | chunk: none > > > > EntityLinker - 32:'e' (lemma: null) linkable=false, matchable=false > > > > EntityLinker + 34:'servitude' (lemma: null) linkable=true, matchable=true > > > > EntityLinker >> searchStrings [ ser, servitude] > > > > EntityLinker - found 0 entities ... > > > > > > As we can see, "AE" is never processed. What am i doign wrong ? Thank you > > in advance > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
