Re: Problem with entityLinking on Uppercase tokens

Joseph M'Bimbi-Bene Mon, 03 Jun 2013 02:51:48 -0700

It is weird ... anyway it is not really an annoyance. But i am having (yet)
another little problem:
Here is the configuration of my linking engine:


*;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75

Since I didn't want to have determiner to be linkable when they are
uppercased at the beginning of a sentence, i explicitely specified
uppercase tokens to not be treated specifically.
Here are some log excerpts:

On token 'La', which is (i think) a determiner, anyway, definitely not a
Noun :

ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: *
ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none'
ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=true]

EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true,
matchable=true | chunk: none

EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, matchable=true

EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true

EntityLinker >> searchStrings [La, recherche]

EntityLinker - found 1 entities ...

EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
for language null

03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
Language null not configured to be supported

MainLabelTokenizer > use Tokenizer class
org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
for language null

MainLabelTokenizer - tokenized la -> [la]

EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1]
score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking:
null


Then i went to the page of the jira issue 1049 and i guessed my token
corresponded to "unknown POS tag rule".
"TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does
this have anything to do with  the the *Upper Case Token Mode *parameter ?*
*
Since my tokens 'La' are always at the beginning of the sentence, i guessed
they falled in the category:
"else - lower case token or sentence or sub-sentence start
        * tokens equals or longer as
TextProcessingConfig#minSearchTokenLength are marked as matchable"

I don't understand that rule: is that supposed to override the *Upper Case
Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to 'la'
and the tokens 'la are never processed. Here is the log excerpt:

ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos:
DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409])
chunk: 'none'

ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)|
matchable=false(*matchablePos=false*)| alpha=true| seachLength=true|
upperCase=false]


After i few minutes of reflexion, i see that linkabkePos and matchablePos are
no longer equals to "null". What is the rule to set them to null or not. It
is strange that just an uppercase can change the POS tag of the token that
drastically for Talismane but i cannot do anything about it. I still have
the interrogation about the supposed overriding of the *Upper Case Token
Mode *parameter for "unknown POS tag rule".



On a quite related topic, the *Upper Case Token Mode *parameter doesn't
seem to behave properly (or i missed something). i let "uc=NONE" in the
config of the engine and monitored the processing of the token, here are
the logs. On the token "utilisée" for the text: "AE est une mesure
couramment utilisée."

ProcessingState   > 5: Token: [543, 551] utilisÃ©e (pos:[Value [pos:
VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: 'none'
ProcessingState     - TokenData:
'utilisÃ©e'[linkable=false(linkabkePos=false)|
matchable=false(matchablePos=false)| alpha=true| seachLength=true|
upperCase=false]

token is not processed, which i am fine with since its POS tag is VPP


Now On the token "Utilisée" for the text: "AE est une mesure couramment
Utilisée."
ProcessingState   > 5: Token: [543, 551] UtilisÃ©e (pos:[Value [*pos: NPP*
(olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none'
ProcessingState     - TokenData:
'UtilisÃ©e'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=true]

so the POS tag is OK, but the prob doesn't reach the threshold (which i set
to 0.55), here is the log of the processing of the token

EntityLinker --- preocess Token 5: UtilisÃ©e (lemma: null) linkable=true,
matchable=true | chunk: none
EntityLinker     - 4:'couramment' (lemma: null) linkable=false,
matchable=false
EntityLinker     - 6:'.' (lemma: null) linkable=false, matchable=false
EntityLinker     + 3:'mesure' (lemma: null) linkable=true, matchable=true
EntityLinker   >> searchStrings [mesure, UtilisÃ©e]

is it a problem of processing of POS tagging, of UpperCase linking or did i
misunderstood something.

Thank you for the time you spend helping us users, it is very appreciated.
best regard, Joseph

2013/6/3 Rupert Westenthaler <[email protected]>

> Hi Joseph
>
> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene
> <[email protected]> wrote:
> > I think it is the tokenizing process of Talismane NLP, since my
> enhancement
> > chain is :
> > -langdetect
> > -talismaneNLP
> > -MyVocabulary
> >
>
> I also used Talismane when testing and I was not seeing tokens like that
>
> Here are an excerpt of my log (with minSearchTokenLength set to 2)
>
> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true
> | chunk: none
>     - 10:'*' (lemma: null) linkable=false, matchable=false
>     - 12:'*' (lemma: null) linkable=false, matchable=false
>     - 9:'indiquant' (lemma: null) linkable=false, matchable=false
>     - 13:'une' (lemma: null) linkable=false, matchable=false
>     - 8:')' (lemma: null) linkable=false, matchable=false
>     + 14:'servitude' (lemma: null) linkable=false, matchable=true
>   >> searchStrings [AE, servitude]
>
> best
> Rupert
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Problem with entityLinking on Uppercase tokens

Reply via email to