It is weird ... anyway it is not really an annoyance. But i am having (yet) another little problem: Here is the configuration of my linking engine:
*;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75 Since I didn't want to have determiner to be linkable when they are uppercased at the beginning of a sentence, i explicitely specified uppercase tokens to not be treated specifically. Here are some log excerpts: On token 'La', which is (i think) a determiner, anyway, definitely not a Noun : ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: * ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none' ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| matchable=true(matchablePos=null)| alpha=true| seachLength=true| upperCase=true] EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true, matchable=true | chunk: none EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, matchable=true EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true EntityLinker >> searchStrings [La, recherche] EntityLinker - found 1 entities ... EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer for language null 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer Language null not configured to be supported MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer for language null 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer Language null not configured to be supported MainLabelTokenizer > use Tokenizer class org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer for language null MainLabelTokenizer - tokenized la -> [la] EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: null EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: null Then i went to the page of the jira issue 1049 and i guessed my token corresponded to "unknown POS tag rule". "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does this have anything to do with the the *Upper Case Token Mode *parameter ?* * Since my tokens 'La' are always at the beginning of the sentence, i guessed they falled in the category: "else - lower case token or sentence or sub-sentence start * tokens equals or longer as TextProcessingConfig#minSearchTokenLength are marked as matchable" I don't understand that rule: is that supposed to override the *Upper Case Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to 'la' and the tokens 'la are never processed. Here is the log excerpt: ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos: DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409]) chunk: 'none' ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)| matchable=false(*matchablePos=false*)| alpha=true| seachLength=true| upperCase=false] After i few minutes of reflexion, i see that linkabkePos and matchablePos are no longer equals to "null". What is the rule to set them to null or not. It is strange that just an uppercase can change the POS tag of the token that drastically for Talismane but i cannot do anything about it. I still have the interrogation about the supposed overriding of the *Upper Case Token Mode *parameter for "unknown POS tag rule". On a quite related topic, the *Upper Case Token Mode *parameter doesn't seem to behave properly (or i missed something). i let "uc=NONE" in the config of the engine and monitored the processing of the token, here are the logs. On the token "utilisée" for the text: "AE est une mesure couramment utilisée." ProcessingState > 5: Token: [543, 551] utilisée (pos:[Value [pos: VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: 'none' ProcessingState - TokenData: 'utilisée'[linkable=false(linkabkePos=false)| matchable=false(matchablePos=false)| alpha=true| seachLength=true| upperCase=false] token is not processed, which i am fine with since its POS tag is VPP Now On the token "Utilisée" for the text: "AE est une mesure couramment Utilisée." ProcessingState > 5: Token: [543, 551] Utilisée (pos:[Value [*pos: NPP* (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none' ProcessingState - TokenData: 'Utilisée'[linkable=true(linkabkePos=null)| matchable=true(matchablePos=null)| alpha=true| seachLength=true| upperCase=true] so the POS tag is OK, but the prob doesn't reach the threshold (which i set to 0.55), here is the log of the processing of the token EntityLinker --- preocess Token 5: Utilisée (lemma: null) linkable=true, matchable=true | chunk: none EntityLinker - 4:'couramment' (lemma: null) linkable=false, matchable=false EntityLinker - 6:'.' (lemma: null) linkable=false, matchable=false EntityLinker + 3:'mesure' (lemma: null) linkable=true, matchable=true EntityLinker >> searchStrings [mesure, Utilisée] is it a problem of processing of POS tagging, of UpperCase linking or did i misunderstood something. Thank you for the time you spend helping us users, it is very appreciated. best regard, Joseph 2013/6/3 Rupert Westenthaler <[email protected]> > Hi Joseph > > On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene > <[email protected]> wrote: > > I think it is the tokenizing process of Talismane NLP, since my > enhancement > > chain is : > > -langdetect > > -talismaneNLP > > -MyVocabulary > > > > I also used Talismane when testing and I was not seeing tokens like that > > Here are an excerpt of my log (with minSearchTokenLength set to 2) > > --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true > | chunk: none > - 10:'*' (lemma: null) linkable=false, matchable=false > - 12:'*' (lemma: null) linkable=false, matchable=false > - 9:'indiquant' (lemma: null) linkable=false, matchable=false > - 13:'une' (lemma: null) linkable=false, matchable=false > - 8:')' (lemma: null) linkable=false, matchable=false > + 14:'servitude' (lemma: null) linkable=false, matchable=true > >> searchStrings [AE, servitude] > > best > Rupert > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
