Based on the reported Issues I identified the following Issues: (1) "Unknown POS tag Rules" need also to be applied if only one of linkablePos and matchablePos is NULL. However those rules can only convert a matchable token to a linkable one if linkablePos==NULL and convert an other token to an matchable one if matchablePos==NULL (reopened STANBOL-1049)
(2) If "Unknown POS tag Rules" do mark an Token as isLinkable=true they MUST also set isMatchable=ture (also STANBOL-1049) (3) the default value for minExcludePosProbability was incorrectly set to 0.75/4=0,1875 instead of the 0.75/2=0,375 (created STANBOL-1063) (4) "link multiple matchable tokens in chunks" should be ignored in chunks that do already contain an linkable token. E.g. in the chunk "Express Tribune newspaper reports" newspaper should not be converted to a linkable token. (created STANBOL-1064 about this). I hope to provide fixes for all those issues later today best Rupert On Tue, May 7, 2013 at 6:09 PM, Rupert Westenthaler <[email protected]> wrote: > Hi all, > > FYI, Joseph provided a detailed report about his problem. A first look > indicates that this problems could potentially be a bug introduced > with STANBOL-1049 [1] however I had not yet time to look into this as > I was traveling for the last 7 days. > > best > Rupert > > [1] https://issues.apache.org/jira/browse/STANBOL-1049 > > On Mon, May 6, 2013 at 10:48 AM, Joseph M'Bimbi-Bene > <[email protected]> wrote: >> i thought it might be a bug in the absence of POS tagging, etc. so i used >> Talismane for NLP tasks, i configured the EnitytihubLinkingEngine to link >> adjectives since it is what Talismane tags "mario" as, but it doesn't >> change anything. here are the logs* >> >> .EntityLinker --- preocess Token 117: *moustachu *(lemma: none | >> pos:[Value [pos: ADJ(olia:Adjective)].prob=0.4520518431389538]) chunk: none >> .EntityLinker - 116:'*plombier'* (lemma: none | pos:[Value [pos: >> NC(olia:CommonNoun|olia:Noun)].prob=0.6784572817881412]).EntityLinker + >> 118:'supérieure' (lemma: none | pos:[Value [pos: >> ADJ(olia:Adjective)].prob=0.9366843193563169]).EntityLinker >> >> searchStrings *[moustachu, supérieure]*.EntityLinker - found 1 >> entities ....EntityLinker >>> http://example.org/resource/Mario (ranking: null).MainLabelTokenizer > >> use Tokenizer class >> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >> for language null >> .MainLabelTokenizer - tokenized le plombier moustachu -> *[le, plombier, >> moustachu]* >> .MainLabelTokenizer > use Tokenizer class >> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >> for language null. >> MainLabelTokenizer - tokenized Mario -> [Mario].EntityLinker - *no match* >> >> why isn't "plombier" in "searchstrings" ? even if i configured the engine >> so that adjective are linkable tokens, according to the documentation, >> "plombier" should be a "matchable token". The behavior of this engine is >> quite disturbing ... >> >> >> 2013/5/6 Joseph M'Bimbi-Bene <[email protected]> >> >>> Hello everybody, i'm having some problems with the EntityhubLinkingEngine. >>> Before about 2 weeks ago, i used it for NER tasks on a custom vocabulary >>> and it worked fine. now I cannot spot entities with label on several words >>> (even with the parameter lmmtip in "languages configuration" and it now >>> seems to be case sensitive, even if configured not to be. >>> >>> Here is what my entity looks like >>> >>> <rdf:Description rdf:about="http://example.org/resource#Mario"> >>> <skos:prefLabel>Mario</skos: prefLabel> >>> <skos:altLabel>le plombier moustachu</skos:altLabel> >>> <rdf:type>http://example.org/concept#gentil</rdf:type> >>> <rdf:type>http://example.org/concept#humain</rdf:type> >>> </rdf:Description> >>> >>> And i want to spot it with the mention "plombier moustachu". >>> here is a log illustrating what i used to have : >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker --- >>> preocess Token 825: plombier (lemma: none | pos:[]) chunk: none >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - >>> 824:'le' (lemma: none | pos:[]) >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker + >>> 826:'moustachu' (lemma: none | pos:[]) >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker >> >>> searchStrings >>> [plombier, moustachu] >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - >>> found 1 entities ... >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker > >>> http://example.org/resource#Mario >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker < >>> le plombier moustachu[m=FULL,s=3,c=3(1.0)/3] score=1.0[l=1.0,t=1.0] for >>> http://example.org/resource#Mario >>> >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker >> >>> Suggestions: >>> 18.04.2013 14:37:15.794 *DEBUG* [Thread-303] >>> org.apache.stanbol.enhancer.engines.entitylinking.impl.EntityLinker - 0: >>> le plombier moustachu[m=FULL,s=3,c=3(1.0)/3] score=1.0[l=1.0,t=1.0] for >>> http://example.org/resource#Mario >>> >>> and here is what i now have: >>> here with the processing of the token "plombier" >>> >>> EntityLinker --- *preocess Token 17: plombier* (lemma: none | pos:[]) >>> chunk: none >>> EntityLinker - 16:'le' (lemma: none | pos:[]) >>> EntityLinker - 18*:'moustachu'* (lemma: none | pos:[]) >>> EntityLinker - 15:'sont' (lemma: none | pos:[]) >>> EntityLinker - 19:'des' (lemma: none | pos:[]) >>> EntityLinker - 14:',' (lemma: none | pos:[]) >>> EntityLinker - 20:'collines' (lemma: none | pos:[]) >>> EntityLinker >> *searchStrings [plombier]* >>> .MainLabelTokenizer > use Tokenizer class >>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >>> for language null >>> MainLabelTokenizer - tokenized le plombier moustachu ->* **[le, >>> plombier, moustachu]* >>> MainLabelTokenizer > use Tokenizer class >>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >>> for language null >>> MainLabelTokenizer - tokenized Mario -> [Mario] >>> EntityLinker -* **no match * >>> >>> why isn't "plombier" or moustachu" in the searchstring, just as before ? >>> and now with the processing of "mario" >>> >>> .EntityLinker --- preocess Token 16: *mario* (lemma: none | pos:[]) >>> chunk: none >>> .EntityLinker - 15:'sont' (lemma: none | pos:[]) >>> .EntityLinker - 17:'des' (lemma: none | pos:[]) >>> .EntityLinker - 14:',' (lemma: none | pos:[]) >>> .EntityLinker - 18:'collines' (lemma: none | pos:[]) >>> .EntityLinker - 13:'mendips' (lemma: none | pos:[]) >>> .EntityLinker - 19:'situées' (lemma: none | pos:[]) >>> .EntityLinker >> searchStrings *[mario]* >>> .EntityLinker - found 1 entities ... >>> .EntityLinker > http://example.org/resource/Mario (ranking: null) >>> .MainLabelTokenizer > use Tokenizer class >>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >>> for language null >>> .MainLabelTokenizer - tokenized le plombier moustachu -> [le, plombier, >>> moustachu] >>> .MainLabelTokenizer > use Tokenizer class >>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer >>> for language null >>> .MainLabelTokenizer - tokenized Mario -> *[Mario]* >>> .EntityLinker - *no match* >>> >>> why isn't "mario" matched against "Mario", i configured the engine so >>> thtat it is not case sensitive >>> >>> as you can see, in the MaxTokenSearchDistance, i still have "le" and >>> "moustachu" tokens but it doesn't go in the SearchString for lookup. In the >>> result of the enhancement is now pretty bad. What is going on ? >>> >>> Thank you a lot in advance >>> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
