Re: Problem with entityLinking on Uppercase tokens

Rupert Westenthaler Mon, 03 Jun 2013 07:53:20 -0700

Hi Joseph

On Mon, Jun 3, 2013 at 3:43 PM, Joseph M'Bimbi-Bene
<[email protected]> wrote:
[..]
>
> Now, the logs of the processing of the token "La"
>
> ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
> ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'
>
> ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> upperCase=true]
>


The reason why the 'La' of the last sentence of your document is
marked as 'linkable' is the combination of the following things:

1. the POS tag has a very low probability (0.017) and is therefore
ignored as the configured minimum probability is higher as that.
2. Proper Noun Linking (enhancer.engines.linking.properNounsState) is
deactivated
3. UpperCase linking of Tokens without POS tag
(enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag) is
also deactivated. As the default is the same as the value for Proper
Noun Linking (see STANBOL-1049).
4. The Min Search Token Length
(enhancer.engines.linking.minSearchTokenLength) is set to two.

As there is no POS tag and UpperCase linking of Tokens without POS tag
is deactivated, the Min Search Token Length is the only criteria used
to classify the Token. As 'La' has >= two chars it is therefore
classified as a 'linkable' token. If the token is upper/lower case
and/or at the beginning of a sentence is no of any importance in this
specific case.

This might seam strange in the given context, but in situations where
there is no POS tagging support for the language of the parsed text
this behavior is completely fine and important. Otherwise it would not
be possible to link entities mentioned as first word in a sentence.

In your specific case explicitly setting
'enhancer.engines.linking.linkOnlyUpperCaseTokensWithMissingPosTag=true'
would prevent 'La' to be classified as 'linkable'. But the root of the
problem is Talismane failing to detect the POS tag for this token.

Situations like that do however suggest to investigate if
EntityLinking should use different fallback strategies for

* linking texts without POS tags - where no POS tagger is available
for the language of the text
* linking tokens with missing POS tag - where a POS tagger is present,
but it was not able to classify a token

best
Rupert


> [...]
>
> EntityLinker --- preocess Token 0: La (lemma: null) linkable=true,
> matchable=true | chunk: none
>
> EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true
>
> EntityLinker >> searchStrings [La, recherche]
>
> EntityLinker - found 1 entities ...
>
> EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
>
> abelTokenizer for language null
>
> abelTokenizer Language null not configured to be supported
>
> abelTokenizer for language null
>
> abelTokenizer Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> for language null
>
> MainLabelTokenizer - tokenized la -> [la]
>
> EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
>
> EntityLinker >> Suggestions:
>
> EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
>
>
> So same as before. Is Open NLP working along well with Talismane. I saw
> that the ranking the sentence detection engine was lower than the ranking
> of talismane and the linking engine (-100 vs 0) since the documentation of
> the engine says
> *"Language* (required): The language of the text needs to be available. It
> is read as specified by
> STANBOL-613<https://issues.apache.org/jira/browse/STANBOL-613>from the
> metadata of the ContentItem. Effectively this means that any
> Stanbol Language Detection engine will need to be executed *before the
> OpenNLP POS Tagging Engine.*" which is Talismane in my case.
>
> The logs are exactly the same, but just for the sake of it (or if i missed
> something), i will copy them:
>
> OpenNlpSentenceDetectionEngine > add Sentence: [249, 431]
>
> OpenNlpSentenceDetectionEngine > add Sentence: [432, 513]
>
> OpenNlpSentenceDetectionEngine > add Sentence: [514, 552]
>
> OpenNlpSentenceDetectionEngine > add Sentence: [554, 780]
>
> OpenNlpSentenceDetectionEngine > add Sentence: [781, 971]
>
> OpenNlpSentenceDetectionEngine > add Sentence: [972, 1085]
>
> OpenNlpSentenceDetectionEngine > add Sentence: [1087, 1264]
>
>
> ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
> ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'
>
> ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> upperCase=true]
>
>
> EntityLinker --- preocess Token 0: La (lemma: null) linkable=true,
> matchable=true | chunk: none
>
> EntityLinker + 1:'recherche' (lemma: null) linkable=true, matchable=true
>
> EntityLinker >> searchStrings [La, recherche]
>
> EntityLinker - found 1 entities ...
>
> EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> for language null
>
> 03.06.2013 15:41:53.188 *TRACE* [Thread-5674]
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> for language null
>
> 03.06.2013 15:41:53.188 *TRACE* [Thread-5674]
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> Language null not configured to be supported
>
> MainLabelTokenizer > use Tokenizer class
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> for language null
>
> MainLabelTokenizer - tokenized la -> [la]
>
> EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
>
> EntityLinker >> Suggestions:
>
> EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
>
>
>
>
> Can you sent the text sample you used, so that I can check why
>> Talismane fails to correctly split the sentences.
>>
>> best
>> Rupert
>>
>> > Here are some log excerpts:
>> >
>> > On token 'La', which is (i think) a determiner, anyway, definitely not a
>> > Noun :
>> >
>> > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: *
>> > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none'
>> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
>> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
>> > upperCase=true]
>> >
>> > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true,
>> > matchable=true | chunk: none
>> >
>>
>> Here it says that La is the 15th token of the Sentence. This is the
>> reason why it is marked as linkable.
>>
>>
> ok, i think I understand ... but if i get it right, then by lowercasing it,
> the token should be linked / linkable too. But it is not, Search "<look
> here>" for the part of the message related to it
>
>
>>
>> > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true,
>> matchable=true
>> >
>> > EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true
>> >
>> > EntityLinker >> searchStrings [La, recherche]
>> >
>> > EntityLinker - found 1 entities ...
>> >
>> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
>> >
>> > MainLabelTokenizer > use Tokenizer class
>> >
>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
>> > for language null
>> >
>> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
>> >
>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
>> > Language null not configured to be supported
>> >
>> > MainLabelTokenizer > use Tokenizer class
>> >
>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
>> > for language null
>> >
>> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
>> >
>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
>> > Language null not configured to be supported
>> >
>> > MainLabelTokenizer > use Tokenizer class
>> >
>> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
>> > for language null
>> >
>> > MainLabelTokenizer - tokenized la -> [la]
>> >
>> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
>> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
>> > EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1]
>> > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking:
>> > null
>> >
>> >
>> > Then i went to the page of the jira issue 1049 and i guessed my token
>> > corresponded to "unknown POS tag rule".
>> > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does
>> > this have anything to do with  the the *Upper Case Token Mode *parameter
>> ?*
>> > *
>> > Since my tokens 'La' are always at the beginning of the sentence, i
>> guessed
>> > they falled in the category:
>> > "else - lower case token or sentence or sub-sentence start
>> >         * tokens equals or longer as
>> > TextProcessingConfig#minSearchTokenLength are marked as matchable"
>> >
>> > I don't understand that rule: is that supposed to override the *Upper
>> Case
>> > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to
>> 'la'
>> > and the tokens 'la are never processed. Here is the log excerpt:
>> >
>> > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos:
>> > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409])
>> > chunk: 'none'
>> >
>>
>
> <look here>
>
>
>> > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)|
>> > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true|
>> > upperCase=false]
>> >
>> >
>> > After i few minutes of reflexion, i see that linkabkePos and
>> matchablePos are
>> > no longer equals to "null". What is the rule to set them to null or not.
>> It
>> > is strange that just an uppercase can change the POS tag of the token
>> that
>> > drastically for Talismane but i cannot do anything about it. I still have
>> > the interrogation about the supposed overriding of the *Upper Case Token
>> > Mode *parameter for "unknown POS tag rule".
>> >
>> >
>> >
>> > On a quite related topic, the *Upper Case Token Mode *parameter doesn't
>> > seem to behave properly (or i missed something). i let "uc=NONE" in the
>> > config of the engine and monitored the processing of the token, here are
>> > the logs. On the token "utilisée" for the text: "AE est une mesure
>> > couramment utilisée."
>> >
>> > ProcessingState   > 5: Token: [543, 551] utilisÃ©e (pos:[Value [pos:
>> > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk:
>> 'none'
>> > ProcessingState     - TokenData:
>> > 'utilisÃ©e'[linkable=false(linkabkePos=false)|
>> > matchable=false(matchablePos=false)| alpha=true| seachLength=true|
>> > upperCase=false]
>> >
>> > token is not processed, which i am fine with since its POS tag is VPP
>> >
>> >
>> > Now On the token "Utilisée" for the text: "AE est une mesure couramment
>> > Utilisée."
>> > ProcessingState   > 5: Token: [543, 551] UtilisÃ©e (pos:[Value [*pos:
>> NPP*
>> > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none'
>> > ProcessingState     - TokenData:
>> > 'UtilisÃ©e'[linkable=true(linkabkePos=null)|
>> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
>> > upperCase=true]
>> >
>> > so the POS tag is OK, but the prob doesn't reach the threshold (which i
>> set
>> > to 0.55), here is the log of the processing of the token
>> >
>> > EntityLinker --- preocess Token 5: UtilisÃ©e (lemma: null) linkable=true,
>> > matchable=true | chunk: none
>> > EntityLinker     - 4:'couramment' (lemma: null) linkable=false,
>> > matchable=false
>> > EntityLinker     - 6:'.' (lemma: null) linkable=false, matchable=false
>> > EntityLinker     + 3:'mesure' (lemma: null) linkable=true, matchable=true
>> > EntityLinker   >> searchStrings [mesure, UtilisÃ©e]
>> >
>> > is it a problem of processing of POS tagging, of UpperCase linking or
>> did i
>> > misunderstood something.
>> >
>> > Thank you for the time you spend helping us users, it is very
>> appreciated.
>> > best regard, Joseph
>> >
>> > 2013/6/3 Rupert Westenthaler <[email protected]>
>> >
>> >> Hi Joseph
>> >>
>> >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene
>> >> <[email protected]> wrote:
>> >> > I think it is the tokenizing process of Talismane NLP, since my
>> >> enhancement
>> >> > chain is :
>> >> > -langdetect
>> >> > -talismaneNLP
>> >> > -MyVocabulary
>> >> >
>> >>
>> >> I also used Talismane when testing and I was not seeing tokens like that
>> >>
>> >> Here are an excerpt of my log (with minSearchTokenLength set to 2)
>> >>
>> >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true
>> >> | chunk: none
>> >>     - 10:'*' (lemma: null) linkable=false, matchable=false
>> >>     - 12:'*' (lemma: null) linkable=false, matchable=false
>> >>     - 9:'indiquant' (lemma: null) linkable=false, matchable=false
>> >>     - 13:'une' (lemma: null) linkable=false, matchable=false
>> >>     - 8:')' (lemma: null) linkable=false, matchable=false
>> >>     + 14:'servitude' (lemma: null) linkable=false, matchable=true
>> >>   >> searchStrings [AE, servitude]
>> >>
>> >> best
>> >> Rupert
>> >>
>> >>
>> >> --
>> >> | Rupert Westenthaler             [email protected]
>> >> | Bodenlehenstraße 11                             ++43-699-11108907
>> >> | A-5500 Bischofshofen
>> >>
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>



--
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Problem with entityLinking on Uppercase tokens

Reply via email to