Hello everybody, i'm having some problems with the EntityhubLinkingEngine. Before about 2 weeks ago, i used it for NER tasks on a custom vocabulary and it worked fine. now I cannot spot entities with label on several words (even with the parameter lmmtip in "languages configuration" and it now seems to be case sensitive, even if configured not to be.
Here is what my entity looks like <rdf:Description rdf:about="http://example.org/resource#Mario"> <skos:prefLabel>Mario</skos: prefLabel> <skos:altLabel>le plombier moustachu</skos:altLabel> <rdf:type>http://example.org/concept#gentil</rdf:type> <rdf:type>http://example.org/concept#humain</rdf:type> </rdf:Description> And i want to spot it with the mention "plombier moustachu". here is a log illustrating what i used to have : 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker --- preocess Token 825: plombier (lemma: none | pos:[]) chunk: none 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - 824:'le' (lemma: none | pos:[]) 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker + 826:'moustachu' (lemma: none | pos:[]) 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker >> searchStrings [plombier, moustachu] 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - found 1 entities ... 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker > http://example.org/resource#Mario 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker < le plombier moustachu[m=FULL,s=3,c=3(1.0)/3] score=1.0[l=1.0,t=1.0] for http://example.org/resource#Mario 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker >> Suggestions: 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - 0: le plombier moustachu[m=FULL,s=3,c=3(1.0)/3] score=1.0[l=1.0,t=1.0] for http://example.org/resource#Mario and here is what i now have: here with the processing of the token "plombier" EntityLinker --- *preocess Token 17: plombier* (lemma: none | pos:[]) chunk: none EntityLinker - 16:'le' (lemma: none | pos:[]) EntityLinker - 18*:'moustachu'* (lemma: none | pos:[]) EntityLinker - 15:'sont' (lemma: none | pos:[]) EntityLinker - 19:'des' (lemma: none | pos:[]) EntityLinker - 14:',' (lemma: none | pos:[]) EntityLinker - 20:'collines' (lemma: none | pos:[]) EntityLinker >> *searchStrings [plombier]* .MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null MainLabelTokenizer - tokenized le plombier moustachu ->* **[le, plombier, moustachu]* MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null MainLabelTokenizer - tokenized Mario -> [Mario] EntityLinker -* **no match * why isn't "plombier" or moustachu" in the searchstring, just as before ? and now with the processing of "mario" .EntityLinker --- preocess Token 16: *mario* (lemma: none | pos:[]) chunk: none .EntityLinker - 15:'sont' (lemma: none | pos:[]) .EntityLinker - 17:'des' (lemma: none | pos:[]) .EntityLinker - 14:',' (lemma: none | pos:[]) .EntityLinker - 18:'collines' (lemma: none | pos:[]) .EntityLinker - 13:'mendips' (lemma: none | pos:[]) .EntityLinker - 19:'situées' (lemma: none | pos:[]) .EntityLinker >> searchStrings *[mario]* .EntityLinker - found 1 entities ... .EntityLinker > http://example.org/resource/Mario (ranking: null) .MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null .MainLabelTokenizer - tokenized le plombier moustachu -> [le, plombier, moustachu] .MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null .MainLabelTokenizer - tokenized Mario -> *[Mario]* .EntityLinker - *no match* why isn't "mario" matched against "Mario", i configured the engine so thtat it is not case sensitive as you can see, in the MaxTokenSearchDistance, i still have "le" and "moustachu" tokens but it doesn't go in the SearchString for lookup. In the result of the enhancement is now pretty bad. What is going on ? Thank you a lot in advance
