Fwd: stanbol bug report

Joseph M'Bimbi-Bene Tue, 14 May 2013 03:45:26 -0700

---------- Forwarded message ----------
From: Joseph M'Bimbi-Bene <[email protected]>
Date: 2013/5/14
Subject: Re: stanbol bug report
To: Rupert Westenthaler <[email protected]>



Hello, sorry for the late answer. I was on a vacation.


2013/5/9 Rupert Westenthaler <[email protected]>

> Hi Joseph,
>
> finally I got time to have a detailed look at your problem. Please
> NOTE my earlier mail to the Stanbol mailing list referring to four
> issues I identified in the EntityLinking process based on your bug
> report.
>
> In this mail I will try to answer some additional issues related to
> other sections of your report:
>
> Related to "Problem with '#' entity URIs"
> ===========================
>
> In curl request for Entities that do use '#' you will need to URL
> encode this char with '%23'. Otherwise curl will cut of the URL at
> this position. See the following log of an according curl request in
> verbose (-v) mode.
>
> curl -v "
> http://localhost:8080/entityhub/entity?id=http://semanticweb.org/joseph/ontologies/2013/3/ontologieTest2#concept1
> "
> * About to connect() to localhost port 8080 (#0)
> *   Trying ::1...
> * connected
> * Connected to localhost (::1) port 8080 (#0)
> > GET /entityhub/entity?id=
> http://semanticweb.org/joseph/ontologies/2013/3/ontologieTest2 HTTP/1.1
> > User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0
> OpenSSL/0.9.8r zlib/1.2.5
> > Host: localhost:8080
> > Accept: */*
>
> Here is the same request with an URL Encoded '#'
>
> curl -v "
> http://localhost:8080/entityhub/entity?id=http://semanticweb.org/joseph/ontologies/2013/3/ontologieTest2%23concept1
> "
> * About to connect() to localhost port 8080 (#0)
> *   Trying ::1...
> * connected
> * Connected to localhost (::1) port 8080 (#0)
> > GET /entityhub/entity?id=
> http://semanticweb.org/joseph/ontologies/2013/3/ontologieTest2%23concept1HTTP/1.1
> > User-Agent: curl/7.24.0 (x86_64-apple-darwin12.0) libcurl/7.24.0
> OpenSSL/0.9.8r zlib/1.2.5
> > Host: localhost:8080
> > Accept: */*
>
>
Thank you for that one, it works perfect with urlencode


> In this case the Stanbol Log will contain
>
> GET /entity Request
>    > id:
> http://semanticweb.org/joseph/ontologies/2013/3/ontologieTest2#concept1
>    > accept: [*/*]
>
> and you should be able to retrieve the according Entity.
>
> (1) regarding "1 – with OpenNLP – english :"
>
> IMO the engine behaves exactly as it should. As far as I can see the
> probabilities of the POS tags for the french words "le plombier
> moustachu" are to low so that they get accepted. Because of that the
> fallback is used.
>
> However NOTE that the behavior of the fallback was changed with
> STANBOL-1049 (see [1] for details). Since STANBOL-1049 if "proper noun
> linking" is activated only upper case tokens with >=
> MinSearchTokenLength characters are linked.
>
>
> (2) uppercase problem
> =================
>
> I was not able to replicate this problem. Can you please provide
>
> * the RDF file with the data,
> * the EnhancementChain configuration used
> * the Text sent to the Enhancer that triggered the described issue.
>
>
Well, it was just a newbie mistake, a misspelled the name of the type.
Sorry for that ...



> (3) Talismane Integration :
> ===================
>
> I added an Entity
>
> <rdf:Description rdf:about="http://example.org/resource/Mario";>
>     <skos:prefLabel>Mario</skos:prefLabel>
>     <skos:altLabel>le plombier moustachu</skos:altLabel>
>     <rdfs:label>Mario</rdfs:label>
>     <rdfs:label>le plombier moustachu</rdfs:label>
>     <rdf:type>http://example.org/concept#gentil</rdf:type>
>     <rdf:type>http://example.org/concept#humain</rdf:type>
>   </rdf:Description>
>
> configured an EnhancementChain with
>
> * langdetect
> * talismane-nlp
> * EntityLinkingEngine for the site with the Entity and DEACTIVATED
> proper noun linking
>
> and sent the text
>
>     Mario Kart 7, le plombier moustachu est toujours un pilote d'élite
>
>
it works well with this very text, but for example, with the text "Mario
Kart 7, le plombier conducteur moustachu est toujours un pilote d'élite",
only Mario gets recognized
Here is an extract from the logs:

 Token: [17, 25] plombier (pos:[Value [pos:
NC(olia:CommonNoun|olia:Noun)].prob=0.9488428433253189]) chunk: 'none'
 TokenData: 'plombier'[linkable=true(linkabkePos=true)|
matchable=true(matchablePos=true)| alpha=true| seachLength=true|
upperCase=false]
[...]

EntityLinker --- preocess Token 5:* plombier *(lemma: null) linkable=true,
matchable=true | chunk: none

EntityLinker - 4:'le' (lemma: null) linkable=false, matchable=false

EntityLinker + 6:'conducteur' (lemma: null) linkable=true, matchable=true

EntityLinker >> searchStrings *[plombier, conducteur]*

EntityLinker - found 1 entities ...

EntityLinker > http://example.org/resource/Mario (ranking: null)

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

14.05.2013 12:26:00.498 *TRACE* [Thread-476]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

14.05.2013 12:26:00.498 *TRACE* [Thread-476]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
for language null

MainLabelTokenizer - tokenized le plombier moustachu -> [le, plombier,
moustachu]

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

14.05.2013 12:26:00.498 *TRACE* [Thread-476]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

14.05.2013 12:26:00.499 *TRACE* [Thread-476]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
for language null

MainLabelTokenizer - tokenized mario -> [mario]

EntityLinker - *no match*


I guess i misunderstood the process. What is the role of "searchstring" and
the tokens inside precisely ? The documentation says the query
"{lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]" and "Tokens
in the Label are matched with Tokens in the text until the first matchable
or 2nd non-matchable token is not found". Here is the logs describing the
tokens

 ProcessingState > 0: Token: [0, 5] mario (pos:[Value [pos:
NC(olia:CommonNoun|olia:Noun)].prob=0.22461293861915013]) chunk: 'none'

ProcessingState - TokenData: 'mario'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 1: Token: [6, 10] Kart (pos:[Value [pos:
NPP(olia:ProperNoun|olia:Noun)].prob=0.9133145649655761]) chunk: 'none'

ProcessingState - TokenData: 'Kart'[linkable=true(linkabkePos=true)|
matchable=true(matchablePos=true)| alpha=true| seachLength=true|
upperCase=true]

ProcessingState > 2: Token: [11, 12] 7 (pos:[Value [pos:
ADJ(olia:Adjective)].prob=0.684378804826061]) chunk: 'none'

ProcessingState - TokenData: '7'[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=false|
upperCase=false]

ProcessingState > 3: Token: [12, 13] , (pos:[Value [pos:
PONCT(olia:Punctuation)].prob=0.9937588119978392]) chunk: 'none'

ProcessingState - TokenData: ','[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=false| seachLength=false|
upperCase=false]

ProcessingState > 4: Token: [14, 16] le (pos:[Value [pos:
DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9911958397157623])
chunk: 'none'

ProcessingState - TokenData: 'le'[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=false|
upperCase=false]

ProcessingState > 5: Token: [17, 25] plombier (pos:[Value [pos:
NC(olia:CommonNoun|olia:Noun)].prob=0.9488428433253189]) chunk: 'none'

ProcessingState - TokenData: 'plombier'[linkable=true(linkabkePos=true)|
matchable=true(matchablePos=true)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 6: Token: [26, 36] conducteur (pos:[Value [pos:
V(olia:IndicativeVerb|olia:Verb)].prob=0.2741312171804974]) chunk: 'none'

ProcessingState - TokenData: 'conducteur'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 7: Token: [37, 46] moustachu (pos:[Value [pos:
NC(olia:CommonNoun|olia:Noun)].prob=0.6753938233376575]) chunk: 'none'

ProcessingState - TokenData: 'moustachu'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 8: Token: [47, 50] est (pos:[Value [pos:
V(olia:IndicativeVerb|olia:Verb)].prob=0.9590976352002423]) chunk: 'none'

ProcessingState - TokenData: 'est'[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 9: Token: [51, 59] toujours (pos:[Value [pos:
ADV(olia:Adverb)].prob=0.9971420691589198]) chunk: 'none'

ProcessingState - TokenData: 'toujours'[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 10: Token: [60, 62] un (pos:[Value [pos:
DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.958902636002953])
chunk: 'none'

ProcessingState - TokenData: 'un'[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=false|
upperCase=false]

ProcessingState > 11: Token: [63, 69] pilote (pos:[Value [pos:
NC(olia:CommonNoun|olia:Noun)].prob=0.9954414570279821]) chunk: 'none'

ProcessingState - TokenData: 'pilote'[linkable=true(linkabkePos=true)|
matchable=true(matchablePos=true)| alpha=true| seachLength=true|
upperCase=false]

ProcessingState > 12: Token: [70, 72] d' (pos:[Value [pos:
P(olia:Preposition|olia:Adposition)].prob=0.9908561726534587]) chunk: 'none'

ProcessingState - TokenData: 'd''[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=false|
upperCase=false]

ProcessingState > 13: Token: [72, 77] élite (pos:[Value [pos:
NC(olia:CommonNoun|olia:Noun)].prob=0.9572556202780127]) chunk: 'none'

ProcessingState - TokenData: 'élite'[linkable=true(linkabkePos=true)|
matchable=true(matchablePos=true)| alpha=true| seachLength=true|
upperCase=false]

with the logs of the processing of "plombier", i thought "le", "plombier"
and "moustachu" was considered in the query. With "conducteur" in between,
the matching score should be something like 0.81 (that's what i get with
the keywordlinkingengine) and "le plombier "moustachu should be recognized,
but it's not the case. That's what i'm asking about some precision about
the matching process and the role of searchstring.




> With this test I got two mentions for http://example.org/resource/Mario
>
> 1. for Mario
> 2. le plombier moustachu
>
> So I guess that you where missing the mention for "le plombier
> moustachu" because "proper noun linking" was activated
>
> Here are the details for the matching of "le plombier moustachu"
>
> --- preocess Token 5: plombier (lemma: null) linkable=true,
> matchable=true | chunk: none
>      - 4:'le' (lemma: null) linkable=false, matchable=false
>      - 6:'moustachu' (lemma: null) linkable=false, matchable=false
>      - 3:',' (lemma: null) linkable=false, matchable=false
>      - 7:'est' (lemma: null) linkable=false, matchable=false
>      - 2:'7' (lemma: null) linkable=false, matchable=false
>      - 8:'toujours' (lemma: null) linkable=false, matchable=false
>    >> searchStrings [plombier]
>     - found 1 entities ...
>      > http://example.org/resource/Mario (ranking: null)
>        + le plombier moustachu[m=FULL,s=3,c=3(1.0)/3]
> score=1.0[l=1.0,t=1.0] for http://example.org/resource/Mario ranking:
> null
>    >> Suggestions:
>     - 0: le plombier moustachu[m=FULL,s=3,c=3(1.0)/3]
> score=1.0[l=1.0,t=1.0] for http://example.org/resource/Mario ranking:
> null
>
> NOTE that the ">> searchStrings [plombier]" is expected as moustachu
> is classified as an Adjective and is therefore not a 'matchable'
> token.
>
>     > Token 6: Token: [26, 35] moustachu (pos:[Value [pos:
> ADJ(olia:Adjective)].prob=0.494646828200689]) chunk: 'none'
>
> So I guess that this issue will be also fixed as soon as four bug
> fixes described in the other mail are applied.
>
> best
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/STANBOL-1049#comment-13640146
>
> On Mon, May 6, 2013 at 9:32 PM, Joseph M'Bimbi-Bene
> <[email protected]> wrote:
> > OK, no problem. Enjoy your trip
> >
> >
> > 2013/5/5 Rupert Westenthaler <[email protected]>
> >>
> >> Hi Joseph,
> >>
> >> Thanks for the detailed report. I was traveling the last 4 days and
> >> will again traveling the next three days. So most likely I will not
> >> have time to look into this until Thursday.
> >>
> >> best
> >> Rupert
> >>
> >> On Fri, May 3, 2013 at 7:40 PM, Joseph M'Bimbi-Bene
> >> <[email protected]> wrote:
> >> > Hello Rupert, it's me again Joseph, here i am again with the "plombier
> >> > moustachu" problem. Sorry for the long document and have a nice
> week-end
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >
> >
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Fwd: stanbol bug report

Reply via email to