Re: Problem with entityLinking on Uppercase tokens

Joseph M'Bimbi-Bene Thu, 06 Jun 2013 08:27:44 -0700

Hello, sorry for the late answer. Thank you for yours


2013/6/3 Rupert Westenthaler <[email protected]>

> Hi Joseph
>
> On Mon, Jun 3, 2013 at 3:43 PM, Joseph M'Bimbi-Bene
> <[email protected]> wrote:
> [..]
> >
> > Now, the logs of the processing of the token "La"
> >
> > ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
> > ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'
> >
> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> > upperCase=true]
> >
>
> The reason why the 'La' of the last sentence of your document is
> marked as 'linkable' is the combination of the following things:
>
> 1. the POS tag has a very low probability (0.017) and is therefore
> ignored as the configured minimum probability is higher as that.
>

Actually, i set both parameters "prop" and "pprob" to 0.01 , i didn't
commit any mistake, did i ? You mentionned or a previous mail something
about a strange tokenizing behaviour, it might be a source of a new
problem: here is, for example a log excerpt from the stanbol web console
for an integration test. I isolated the pathologic case :

org.apache.stanbol.enhancer.servicesapi.ChainException: Enhancement
Chain failed because of required Engine 'talismane-nlp' failed with
Message: Unable to process ContentItem
'<urn:content-item-sha1-27bdb282be8f827392a55c8cd8d0ee5c740e247a>'
with Enhancement Engine 'talismane-nlp' because the engine was unable
to process the content (Engine class:
org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine)(Reason:
'RestfulNlpAnalysisEngine' failed to process content item
'urn:content-item-sha1-27bdb282be8f827392a55c8cd8d0ee5c740e247a' with
type 'text/plain': Exception while executing Request on RESTful NLP
Analysis Service at http://localhost:9101/analysis)!
        at 
org.apache.stanbol.enhancer.jobmanager.event.impl.EventJobManagerImpl.enhanceContent(EventJobManagerImpl.java:179)

[...]

Caused by: org.apache.stanbol.enhancer.servicesapi.EngineException:
'RestfulNlpAnalysisEngine' failed to process content item
'urn:content-item-sha1-27bdb282be8f827392a55c8cd8d0ee5c740e247a' with
type 'text/plain': Exception while executing Request on RESTful NLP
Analysis Service at http://localhost:9101/analysis
        at 
org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine.computeEnhancements(RestfulNlpAnalysisEngine.java:285)
        at 
org.apache.stanbol.enhancer.jobmanager.event.impl.EnhancementJobHandler.processEvent(EnhancementJobHandler.java:271)

[...]


Caused by: org.apache.http.client.HttpResponseException: Internal Server Error
        at 
org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine$AnalysisResponseHandler.handleResponse(RestfulNlpAnalysisEngine.java:367)
        at 
org.apache.stanbol.enhancer.engines.restful.nlp.impl.RestfulNlpAnalysisEngine$AnalysisResponseHandler.handleResponse(RestfulNlpAnalysisEngine.java:341)


and when i curl the text to Talismane, i get the following message:

16:49:21,166 [main] INFO server.Main - ... starting server
16:53:55,560 [btpool0-2] ERROR resource.AnalysisResource - Exception while
analysing Blob
java.lang. IllegalArgumentException: Illegal span [2199,2201] for Token
relative to Text: [0, 2200] : Span of the contained Token MUST NOT extend
the others!
at
org.apache.stanbol.enhancer.nlp.model.impl.SpanImpl.<init>(SpanImpl.java:78)
at
org.apache.stanbol.enhancer.nlp.model.impl.TokenImpl.<init>(TokenImpl.java:33)
at
org.apache.stanbol.enhancer.nlp.model.impl.SectionImpl.addToken(SectionImpl.java:146)
at
at.salzburgresearch.stanbol.enhancer.nlp.talismane.analyser.TalismaneAnalyzer.processSentence(TalismaneAnalyzer.java:329)

i will give you the text in private so you can try to reproduce the bug
since i don't want it to be online

EDIT: i tried again just now, i have the same message in the stanbol
console, but everything is fine when curling Talismane ... I also tried on
the "dbpedia-disambiguation" enhancing chain, with the french opennlp model
published by Mr. Hernandez that i talked to you about some time ago, as i
reminder, here is the link it it seem quite OK. No entities spotted but


> 2. Proper Noun Linking (enhancer.engines.linking.properNounsState) is
> deactivated
> 3. UpperCase linking of Tokens without POS tag
> (enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag) is
> also deactivated. As the default is the same as the value for Proper
> Noun Linking (see STANBOL-1049).
> 4. The Min Search Token Length
> (enhancer.engines.linking.minSearchTokenLength) is set to two.
>

> As there is no POS tag and UpperCase linking of Tokens without POS tag
> is deactivated, the Min Search Token Length is the only criteria used
> to classify the Token. As 'La' has >= two chars it is therefore
> classified as a 'linkable' token. If the token is upper/lower case
> and/or at the beginning of a sentence is no of any importance in this
> specific case.
>
> This might seam strange in the given context, but in situations where
> there is no POS tagging support for the language of the parsed text
> this behavior is completely fine and important. Otherwise it would not
> be possible to link entities mentioned as first word in a sentence.
>
> In your specific case explicitly setting
> 'enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag=true'
> would prevent 'La' to be classified as 'linkable'. But the root of the
> problem is Talismane failing to detect the POS tag for this token.
>
> Situations like that do however suggest to investigate if
> EntityLinking should use different fallback strategies for
>
> * linking texts without POS tags - where no POS tagger is available
> for the language of the text
> * linking tokens with missing POS tag - where a POS tagger is present,
> but it was not able to classify a token
>
> best
> Rupert
>
>
> > [...]
> >
> > EntityLinker --- preocess Token 0: La (lemma: null) linkable=true,
> > matchable=true | chunk: none
> >
> > EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true
> >
> > EntityLinker >> searchStrings [La, recherche]
> >
> > EntityLinker - found 1 entities ...
> >
> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
> >
> > abelTokenizer for language null
> >
> > abelTokenizer Language null not configured to be supported
> >
> > abelTokenizer for language null
> >
> > abelTokenizer Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> > for language null
> >
> > MainLabelTokenizer - tokenized la -> [la]
> >
> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> >
> > EntityLinker >> Suggestions:
> >
> > EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> >
> >
> > So same as before. Is Open NLP working along well with Talismane. I saw
> > that the ranking the sentence detection engine was lower than the ranking
> > of talismane and the linking engine (-100 vs 0) since the documentation
> of
> > the engine says
> > *"Language* (required): The language of the text needs to be available.
> It
> > is read as specified by
> > STANBOL-613<https://issues.apache.org/jira/browse/STANBOL-613>from the
> > metadata of the ContentItem. Effectively this means that any
> > Stanbol Language Detection engine will need to be executed *before the
> > OpenNLP POS Tagging Engine.*" which is Talismane in my case.
> >
> > The logs are exactly the same, but just for the sake of it (or if i
> missed
> > something), i will copy them:
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [249, 431]
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [432, 513]
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [514, 552]
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [554, 780]
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [781, 971]
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085]
> >
> > OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264]
> >
> >
> > ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
> > ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'
> >
> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> > upperCase=true]
> >
> >
> > EntityLinker --- preocess Token 0: La (lemma: null) linkable=true,
> > matchable=true | chunk: none
> >
> > EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true
> >
> > EntityLinker >> searchStrings [La, recherche]
> >
> > EntityLinker - found 1 entities ...
> >
> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 03.06.2013 15:41:53.188 *TRACE* [Thread-5674]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 03.06.2013 15:41:53.188 *TRACE* [Thread-5674]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> > for language null
> >
> > MainLabelTokenizer - tokenized la -> [la]
> >
> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> >
> > EntityLinker >> Suggestions:
> >
> > EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> >
> >
> >
> >
> > Can you sent the text sample you used, so that I can check why
> >> Talismane fails to correctly split the sentences.
> >>
> >> best
> >> Rupert
> >>
> >> > Here are some log excerpts:
> >> >
> >> > On token 'La', which is (i think) a determiner, anyway, definitely
> not a
> >> > Noun :
> >> >
> >> > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: *
> >> > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none'
> >> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> >> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> >> > upperCase=true]
> >> >
> >> > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true,
> >> > matchable=true | chunk: none
> >> >
> >>
> >> Here it says that La is the 15th token of the Sentence. This is the
> >> reason why it is marked as linkable.
> >>
> >>
> > ok, i think I understand ... but if i get it right, then by lowercasing
> it,
> > the token should be linked / linkable too. But it is not, Search "<look
> > here>" for the part of the message related to it
> >
> >
> >>
> >> > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true,
> >> matchable=true
> >> >
> >> > EntityLinker + 16:'recherche' (lemma: null) linkable=true,
> matchable=true
> >> >
> >> > EntityLinker >> searchStrings [La, recherche]
> >> >
> >> > EntityLinker - found 1 entities ...
> >> >
> >> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
> >> >
> >> > MainLabelTokenizer > use Tokenizer class
> >> >
> >>
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> >> > for language null
> >> >
> >> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> >> >
> >>
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> >> > Language null not configured to be supported
> >> >
> >> > MainLabelTokenizer > use Tokenizer class
> >> >
> >>
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> >> > for language null
> >> >
> >> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> >> >
> >>
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> >> > Language null not configured to be supported
> >> >
> >> > MainLabelTokenizer > use Tokenizer class
> >> >
> >>
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> >> > for language null
> >> >
> >> > MainLabelTokenizer - tokenized la -> [la]
> >> >
> >> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> >> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> >> > EntityLinker >> Suggestions:EntityLinker - 0:
> LA[m=FULL,s=1,c=1(1.0)/1]
> >> > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LAranking:
> >> > null
> >> >
> >> >
> >> > Then i went to the page of the jira issue 1049 and i guessed my token
> >> > corresponded to "unknown POS tag rule".
> >> > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" ->
> does
> >> > this have anything to do with  the the *Upper Case Token Mode
> *parameter
> >> ?*
> >> > *
> >> > Since my tokens 'La' are always at the beginning of the sentence, i
> >> guessed
> >> > they falled in the category:
> >> > "else - lower case token or sentence or sub-sentence start
> >> >         * tokens equals or longer as
> >> > TextProcessingConfig#minSearchTokenLength are marked as matchable"
> >> >
> >> > I don't understand that rule: is that supposed to override the *Upper
> >> Case
> >> > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to
> >> 'la'
> >> > and the tokens 'la are never processed. Here is the log excerpt:
> >> >
> >> > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos:
> >> >
> DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409])
> >> > chunk: 'none'
> >> >
> >>
> >
> > <look here>
> >
> >
> >> > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)|
> >> > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true|
> >> > upperCase=false]
> >> >
> >> >
> >> > After i few minutes of reflexion, i see that linkabkePos and
> >> matchablePos are
> >> > no longer equals to "null". What is the rule to set them to null or
> not.
> >> It
> >> > is strange that just an uppercase can change the POS tag of the token
> >> that
> >> > drastically for Talismane but i cannot do anything about it. I still
> have
> >> > the interrogation about the supposed overriding of the *Upper Case
> Token
> >> > Mode *parameter for "unknown POS tag rule".
> >> >
> >> >
> >> >
> >> > On a quite related topic, the *Upper Case Token Mode *parameter
> doesn't
> >> > seem to behave properly (or i missed something). i let "uc=NONE" in
> the
> >> > config of the engine and monitored the processing of the token, here
> are
> >> > the logs. On the token "utilisée" for the text: "AE est une mesure
> >> > couramment utilisée."
> >> >
> >> > ProcessingState   > 5: Token: [543, 551] utilisÃ©e (pos:[Value [pos:
> >> > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk:
> >> 'none'
> >> > ProcessingState     - TokenData:
> >> > 'utilisÃ©e'[linkable=false(linkabkePos=false)|
> >> > matchable=false(matchablePos=false)| alpha=true| seachLength=true|
> >> > upperCase=false]
> >> >
> >> > token is not processed, which i am fine with since its POS tag is VPP
> >> >
> >> >
> >> > Now On the token "Utilisée" for the text: "AE est une mesure
> couramment
> >> > Utilisée."
> >> > ProcessingState   > 5: Token: [543, 551] UtilisÃ©e (pos:[Value [*pos:
> >> NPP*
> >> > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk:
> 'none'
> >> > ProcessingState     - TokenData:
> >> > 'UtilisÃ©e'[linkable=true(linkabkePos=null)|
> >> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> >> > upperCase=true]
> >> >
> >> > so the POS tag is OK, but the prob doesn't reach the threshold (which
> i
> >> set
> >> > to 0.55), here is the log of the processing of the token
> >> >
> >> > EntityLinker --- preocess Token 5: UtilisÃ©e (lemma: null)
> linkable=true,
> >> > matchable=true | chunk: none
> >> > EntityLinker     - 4:'couramment' (lemma: null) linkable=false,
> >> > matchable=false
> >> > EntityLinker     - 6:'.' (lemma: null) linkable=false, matchable=false
> >> > EntityLinker     + 3:'mesure' (lemma: null) linkable=true,
> matchable=true
> >> > EntityLinker   >> searchStrings [mesure, UtilisÃ©e]
> >> >
> >> > is it a problem of processing of POS tagging, of UpperCase linking or
> >> did i
> >> > misunderstood something.
> >> >
> >> > Thank you for the time you spend helping us users, it is very
> >> appreciated.
> >> > best regard, Joseph
> >> >
> >> > 2013/6/3 Rupert Westenthaler <[email protected]>
> >> >
> >> >> Hi Joseph
> >> >>
> >> >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene
> >> >> <[email protected]> wrote:
> >> >> > I think it is the tokenizing process of Talismane NLP, since my
> >> >> enhancement
> >> >> > chain is :
> >> >> > -langdetect
> >> >> > -talismaneNLP
> >> >> > -MyVocabulary
> >> >> >
> >> >>
> >> >> I also used Talismane when testing and I was not seeing tokens like
> that
> >> >>
> >> >> Here are an excerpt of my log (with minSearchTokenLength set to 2)
> >> >>
> >> >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true
> >> >> | chunk: none
> >> >>     - 10:'*' (lemma: null) linkable=false, matchable=false
> >> >>     - 12:'*' (lemma: null) linkable=false, matchable=false
> >> >>     - 9:'indiquant' (lemma: null) linkable=false, matchable=false
> >> >>     - 13:'une' (lemma: null) linkable=false, matchable=false
> >> >>     - 8:')' (lemma: null) linkable=false, matchable=false
> >> >>     + 14:'servitude' (lemma: null) linkable=false, matchable=true
> >> >>   >> searchStrings [AE, servitude]
> >> >>
> >> >> best
> >> >> Rupert
> >> >>
> >> >>
> >> >> --
> >> >> | Rupert Westenthaler             [email protected]
> >> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> >> | A-5500 Bischofshofen
> >> >>
> >>
> >>
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Problem with entityLinking on Uppercase tokens

Reply via email to