Hi Joseph, Sorry for the late response, but I was offline since last Tuesday.
I will try to explain how it happens that the token 'AE' is not classified as linkable given your example. On Wed, May 29, 2013 at 7:54 PM, Joseph M'Bimbi-Bene <jbi...@object-ive.com> wrote: > Hello everybody, i am having some problems with the EntityhubLinkingEngine. > I am trying to spot mentions of abbreviations, here is the extract of the > RDF describing the pathological entity: > > <rdf:Description rdf:about="http://www.edf.fr/EdfAcronyme.owl#AE"> > <j.1:name>AE</j.1:name> > <dc:description>Acoustic Emission; Architect Engineer </dc:description> > <rdf:type rdf:resource="http://www.edf.fr/EdfAcronyme.owl#Acronyme"/> > <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#NamedIndividual"/> > </rdf:Description> > > The language processing is left ot default : > *;lmmtip;uc=LINK;prop=0.75;pprob=0.75 > NOTE the min probabilities for POS tags are set to 0.75. > "Proper Noun Linking" is deactivated. > > Here is the text where i try to spot my entity: > > "attesté depuis 1480, Knecht (valet) indiquant *AE* une servitude vis-à-vis > de l’« employeur » et Land" > Here are some portions of the log of the processing of the tokens: > > ProcessingState > 30: Token: [78, 80] *AE *(pos:[Value [pos: > NC(olia:CommonNoun|olia:Noun)].prob=0.2837978274717965]) chunk: 'none' > The token AE is classified as olia:Noun, but the probability 0.28 is to low that this information is considered for linking (< 0.75). Because of that this token is treated as if there would be no POS tagging support. > 29.05.2013 19:36:41.243 *DEBUG* [Thread-112] ProcessingState - TokenData: > 'AE'[*linkable=false*(linkabkePos=null)| matchable=true(matchablePos=null)| > alpha=true| seachLength=false| *upperCase=true*] > Because of that "linkable=false", "linkabkePos=null", matchablePos=null. In case no POS tag is available for a Token the configured 'Min Token Length' (enhancer.engines.linking.minSearchTokenLength) is used to decide if a Token should be considered for searching in the controlled vocabulary. As you can see in the above the "seachLength=false". Because of that I assume that you use the default value of this configuration '3'. The token 'AE' has only a length of '2'. As specified by STANBOL-1049 upper case tokens with 'TextProcessingConfig#minSearchTokenLength == false' are only marked matchable and not as linkable. So basically they are just converted from 'other' tokens (that are ignored by the linking process) to 'matchable' tokens. If you want to ensure that upper case tokens with two letters are linked you will need to change the 'Min Token Length' config to '2'. > Here is the remaining of the logs: > > *preocess Token 29: uant AE u* (lemma: null) linkable=true, matchable=true > | chunk: none > > EntityLinker - 28:'di' (lemma: null) linkable=false, matchable=false > > EntityLinker + 30:'AE' (lemma: null) linkable=false, matchable=true > > EntityLinker >> searchStrings [uant AE u, AE] > What Tokenizers do you use* 'uant AE u*' seams to be a strange token. The logging 'searchStrings [uant AE u, AE]' could indicate that you do have two tokenizers in the same enhancement chain. This is something what should definitely be avoided. E.g. users should make sure to configure '!fr' for the OpenNLP Tokenizer engine if they do configure Talismane NLP (via RESTful NLP Analysis Engine) to process french texts. best Rupert > EntityLinker - found 1 entities ... > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#AE (ranking: null) > > MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > for language null > > 29.05.2013 19:36:41.277 *TRACE* [Thread-112] > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > Language null not configured to be supported > > MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > for language null > > 29.05.2013 19:36:41.277 *TRACE* [Thread-112] > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > Language null not configured to be supported > > MainLabelTokenizer > use Tokenizer class > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > for language null > > MainLabelTokenizer - tokenized ae -> [ae] > > EntityLinker - no match > > *EntityLinker --- preocess Token 33: ser* (lemma: null) linkable=true, > matchable=true | chunk: none > > EntityLinker - 32:'e' (lemma: null) linkable=false, matchable=false > > EntityLinker + 34:'servitude' (lemma: null) linkable=true, matchable=true > > EntityLinker >> searchStrings [ ser, servitude] > > EntityLinker - found 0 entities ... > > > As we can see, "AE" is never processed. What am i doign wrong ? Thank you > in advance -- | Rupert Westenthaler rupert.westentha...@gmail.com | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen