Re: Problem with entityLinking on Uppercase tokens

Joseph M'Bimbi-Bene Mon, 03 Jun 2013 07:13:01 -0700

Hello, i forgot to mention that


2013/6/3 Rupert Westenthaler <[email protected]>

> Hi Joseph,
>
> you are right the 'Upper Case Token Mode' interferes with the
> configured UpperCase mode. Maybe it would be better to remove the
> 'Upper Case Token Mode' parameter introduced by STANBOL-1049 and
> implement a similar functionality by using the existing "Upper Case"
> parameter. But I am not yet completely sure if this is possible. I any
> case I will link your previous mail with this issue and not this as an
> unresolved issue for STANBOL-1049.
>
> I think in you specific case it would be best to use a very low
> probability setting (e.g. prop=0.001) as it seams that a lot of the
> suggestions of Talismane are ok, even if they do have a very low
> probability. This would avoid the "unknown POS tag fallback" to take
> effect and therefore workaround the described issues.
>
>
i tried that, here is the config:
"*;lmmtip;uc=NONE;lc=Noun;prop=0.001;pprob=0.75"

But it doesn't change anything ... at all, here is a llog excerpt

ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos:
ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none'

ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
matchable=true(matchablePos=null)| alpha=true| seachLength=true|
upperCase=true]



> In addition you should consider to activate case sensitive matching.
> This would also ensure that 'La' in the text is NOT matched with 'LA'
> in the controlled vocabulary.
>
>
the problem is that there are some longer abbreviations in my vocabulary
and i would like to tag them without case sensitivity.
All i want is that my tokens are linkable according to their POS tags being
superior to a specified threashold ..; and that's it


> Let me also add something about Upper Case and sentence start.
>
> On Mon, Jun 3, 2013 at 11:50 AM, Joseph M'Bimbi-Bene
> <[email protected]> wrote:
> > Here is the configuration of my linking engine:
> >
> > *;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75
> >
> > Since I didn't want to have determiner to be linkable when they are
> > uppercased at the beginning of a sentence, i explicitely specified
> > uppercase tokens to not be treated specifically.
>
> Upper Case Tokens at the beginning of sentences or sub-sentences (e.g
> at the begin of a quote) are ignored. So a 'La' at the beginning of a
> sentence MUST NOT be considered as an upper case token. So if you se
> 'La' to be linked at a sentence start, than this would indicate that
> the sentence detection does not work probably.
>
> Can you sent the text sample you used, so that I can check why
> Talismane fails to correctly split the sentences.
>
> best
> Rupert
>
> > Here are some log excerpts:
> >
> > On token 'La', which is (i think) a determiner, anyway, definitely not a
> > Noun :
> >
> > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: *
> > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none'
> > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)|
> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> > upperCase=true]
> >
> > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true,
> > matchable=true | chunk: none
> >
>
> Here it says that La is the 15th token of the Sentence. This is the
> reason why it is marked as linkable.
>
>
> > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true,
> matchable=true
> >
> > EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true
> >
> > EntityLinker >> searchStrings [La, recherche]
> >
> > EntityLinker - found 1 entities ...
> >
> > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null)
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > for language null
> >
> > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419]
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer
> > Language null not configured to be supported
> >
> > MainLabelTokenizer > use Tokenizer class
> >
> org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer
> > for language null
> >
> > MainLabelTokenizer - tokenized la -> [la]
> >
> > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
> > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null
> > EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1]
> > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking:
> > null
> >
> >
> > Then i went to the page of the jira issue 1049 and i guessed my token
> > corresponded to "unknown POS tag rule".
> > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does
> > this have anything to do with  the the *Upper Case Token Mode *parameter
> ?*
> > *
> > Since my tokens 'La' are always at the beginning of the sentence, i
> guessed
> > they falled in the category:
> > "else - lower case token or sentence or sub-sentence start
> >         * tokens equals or longer as
> > TextProcessingConfig#minSearchTokenLength are marked as matchable"
> >
> > I don't understand that rule: is that supposed to override the *Upper
> Case
> > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to
> 'la'
> > and the tokens 'la are never processed. Here is the log excerpt:
> >
> > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos:
> > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409])
> > chunk: 'none'
> >
> > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)|
> > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true|
> > upperCase=false]
> >
> >
> > After i few minutes of reflexion, i see that linkabkePos and
> matchablePos are
> > no longer equals to "null". What is the rule to set them to null or not.
> It
> > is strange that just an uppercase can change the POS tag of the token
> that
> > drastically for Talismane but i cannot do anything about it. I still have
> > the interrogation about the supposed overriding of the *Upper Case Token
> > Mode *parameter for "unknown POS tag rule".
> >
> >
> >
> > On a quite related topic, the *Upper Case Token Mode *parameter doesn't
> > seem to behave properly (or i missed something). i let "uc=NONE" in the
> > config of the engine and monitored the processing of the token, here are
> > the logs. On the token "utilisée" for the text: "AE est une mesure
> > couramment utilisée."
> >
> > ProcessingState   > 5: Token: [543, 551] utilisÃ©e (pos:[Value [pos:
> > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk:
> 'none'
> > ProcessingState     - TokenData:
> > 'utilisÃ©e'[linkable=false(linkabkePos=false)|
> > matchable=false(matchablePos=false)| alpha=true| seachLength=true|
> > upperCase=false]
> >
> > token is not processed, which i am fine with since its POS tag is VPP
> >
> >
> > Now On the token "Utilisée" for the text: "AE est une mesure couramment
> > Utilisée."
> > ProcessingState   > 5: Token: [543, 551] UtilisÃ©e (pos:[Value [*pos:
> NPP*
> > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none'
> > ProcessingState     - TokenData:
> > 'UtilisÃ©e'[linkable=true(linkabkePos=null)|
> > matchable=true(matchablePos=null)| alpha=true| seachLength=true|
> > upperCase=true]
> >
> > so the POS tag is OK, but the prob doesn't reach the threshold (which i
> set
> > to 0.55), here is the log of the processing of the token
> >
> > EntityLinker --- preocess Token 5: UtilisÃ©e (lemma: null) linkable=true,
> > matchable=true | chunk: none
> > EntityLinker     - 4:'couramment' (lemma: null) linkable=false,
> > matchable=false
> > EntityLinker     - 6:'.' (lemma: null) linkable=false, matchable=false
> > EntityLinker     + 3:'mesure' (lemma: null) linkable=true, matchable=true
> > EntityLinker   >> searchStrings [mesure, UtilisÃ©e]
> >
> > is it a problem of processing of POS tagging, of UpperCase linking or
> did i
> > misunderstood something.
> >
> > Thank you for the time you spend helping us users, it is very
> appreciated.
> > best regard, Joseph
> >
> > 2013/6/3 Rupert Westenthaler <[email protected]>
> >
> >> Hi Joseph
> >>
> >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene
> >> <[email protected]> wrote:
> >> > I think it is the tokenizing process of Talismane NLP, since my
> >> enhancement
> >> > chain is :
> >> > -langdetect
> >> > -talismaneNLP
> >> > -MyVocabulary
> >> >
> >>
> >> I also used Talismane when testing and I was not seeing tokens like that
> >>
> >> Here are an excerpt of my log (with minSearchTokenLength set to 2)
> >>
> >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true
> >> | chunk: none
> >>     - 10:'*' (lemma: null) linkable=false, matchable=false
> >>     - 12:'*' (lemma: null) linkable=false, matchable=false
> >>     - 9:'indiquant' (lemma: null) linkable=false, matchable=false
> >>     - 13:'une' (lemma: null) linkable=false, matchable=false
> >>     - 8:')' (lemma: null) linkable=false, matchable=false
> >>     + 14:'servitude' (lemma: null) linkable=false, matchable=true
> >>   >> searchStrings [AE, servitude]
> >>
> >> best
> >> Rupert
> >>
> >>
> >> --
> >> | Rupert Westenthaler             [email protected]
> >> | Bodenlehenstraße 11                             ++43-699-11108907
> >> | A-5500 Bischofshofen
> >>
>
>
>
> --
> | Rupert Westenthaler             [email protected]
> | Bodenlehenstraße 11                             ++43-699-11108907
> | A-5500 Bischofshofen
>

Re: Problem with entityLinking on Uppercase tokens

Reply via email to