Hello, i forgot to mention that
2013/6/3 Rupert Westenthaler <[email protected]> > Hi Joseph, > > you are right the 'Upper Case Token Mode' interferes with the > configured UpperCase mode. Maybe it would be better to remove the > 'Upper Case Token Mode' parameter introduced by STANBOL-1049 and > implement a similar functionality by using the existing "Upper Case" > parameter. But I am not yet completely sure if this is possible. I any > case I will link your previous mail with this issue and not this as an > unresolved issue for STANBOL-1049. > > I think in you specific case it would be best to use a very low > probability setting (e.g. prop=0.001) as it seams that a lot of the > suggestions of Talismane are ok, even if they do have a very low > probability. This would avoid the "unknown POS tag fallback" to take > effect and therefore workaround the described issues. > > i tried that, here is the config: "*;lmmtip;uc=NONE;lc=Noun;prop=0.001;pprob=0.75" But it doesn't change anything ... at all, here is a llog excerpt ProcessingState > 0: Token: [1087, 1089] La (pos:[Value [pos: ADJ(olia:Adjective)].prob=0.016871281997002517]) chunk: 'none' ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| matchable=true(matchablePos=null)| alpha=true| seachLength=true| upperCase=true] > In addition you should consider to activate case sensitive matching. > This would also ensure that 'La' in the text is NOT matched with 'LA' > in the controlled vocabulary. > > the problem is that there are some longer abbreviations in my vocabulary and i would like to tag them without case sensitivity. All i want is that my tokens are linkable according to their POS tags being superior to a specified threashold ..; and that's it > Let me also add something about Upper Case and sentence start. > > On Mon, Jun 3, 2013 at 11:50 AM, Joseph M'Bimbi-Bene > <[email protected]> wrote: > > Here is the configuration of my linking engine: > > > > *;lmmtip;uc=NONE;lc=Noun;prop=0.55;pprob=0.75 > > > > Since I didn't want to have determiner to be linkable when they are > > uppercased at the beginning of a sentence, i explicitely specified > > uppercase tokens to not be treated specifically. > > Upper Case Tokens at the beginning of sentences or sub-sentences (e.g > at the begin of a quote) are ignored. So a 'La' at the beginning of a > sentence MUST NOT be considered as an upper case token. So if you se > 'La' to be linked at a sentence start, than this would indicate that > the sentence detection does not work probably. > > Can you sent the text sample you used, so that I can check why > Talismane fails to correctly split the sentences. > > best > Rupert > > > Here are some log excerpts: > > > > On token 'La', which is (i think) a determiner, anyway, definitely not a > > Noun : > > > > ProcessingState > *15: Token: [1087, 1089] La* (pos:[Value [pos: * > > ADJ(olia:Adjective)].prob=0.016871281997002517*]) chunk: 'none' > > ProcessingState - TokenData: 'La'[linkable=true(linkabkePos=null)| > > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > > upperCase=true] > > > > EntityLinker --- *preocess Token 15: La* (lemma: null) linkable=true, > > matchable=true | chunk: none > > > > Here it says that La is the 15th token of the Sentence. This is the > reason why it is marked as linkable. > > > > EntityLinker + 14:'cognitives.' (lemma: null) linkable=true, > matchable=true > > > > EntityLinker + 16:'recherche' (lemma: null) linkable=true, matchable=true > > > > EntityLinker >> searchStrings [La, recherche] > > > > EntityLinker - found 1 entities ... > > > > EntityLinker > http://www.edf.fr/EdfAcronyme.owl#LA (ranking: null) > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > for language null > > > > 03.06.2013 11:11:30.809 *TRACE* [Thread-5419] > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.lucene.LuceneLabelTokenizer > > Language null not configured to be supported > > > > MainLabelTokenizer > use Tokenizer class > > > org.apache.stanbol.enhancer.engines.entitylinking.labeltokenizer.opennlp.OpenNlpLabelTokenizer > > for language null > > > > MainLabelTokenizer - tokenized la -> [la] > > > > EntityLinker + LA[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for > > http://www.edf.fr/EdfAcronyme.owl#LA ranking: null > > EntityLinker >> Suggestions:EntityLinker - 0: LA[m=FULL,s=1,c=1(1.0)/1] > > score=1.0[l=1.0,t=1.0] for http://www.edf.fr/EdfAcronyme.owl#LA ranking: > > null > > > > > > Then i went to the page of the jira issue 1049 and i guessed my token > > corresponded to "unknown POS tag rule". > > "TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag" -> does > > this have anything to do with the the *Upper Case Token Mode *parameter > ?* > > * > > Since my tokens 'La' are always at the beginning of the sentence, i > guessed > > they falled in the category: > > "else - lower case token or sentence or sub-sentence start > > * tokens equals or longer as > > TextProcessingConfig#minSearchTokenLength are marked as matchable" > > > > I don't understand that rule: is that supposed to override the *Upper > Case > > Token Mode *parameter ? Anyway i tried with all 'La' lowercased, ie to > 'la' > > and the tokens 'la are never processed. Here is the log excerpt: > > > > ProcessingState > *15: Token: [1087, 1089] la* (pos:[Value [pos: > > DET(olia:Determiner|olia:PronounOrDeterminer)].prob=0.9445673708042409]) > > chunk: 'none' > > > > ProcessingState - TokenData: 'la'[linkable=false(*linkabkePos=false*)| > > matchable=false(*matchablePos=false*)| alpha=true| seachLength=true| > > upperCase=false] > > > > > > After i few minutes of reflexion, i see that linkabkePos and > matchablePos are > > no longer equals to "null". What is the rule to set them to null or not. > It > > is strange that just an uppercase can change the POS tag of the token > that > > drastically for Talismane but i cannot do anything about it. I still have > > the interrogation about the supposed overriding of the *Upper Case Token > > Mode *parameter for "unknown POS tag rule". > > > > > > > > On a quite related topic, the *Upper Case Token Mode *parameter doesn't > > seem to behave properly (or i missed something). i let "uc=NONE" in the > > config of the engine and monitored the processing of the token, here are > > the logs. On the token "utilisée" for the text: "AE est une mesure > > couramment utilisée." > > > > ProcessingState > 5: Token: [543, 551] utilisée (pos:[Value [pos: > > VPP(olia:PastParticiple|olia:Verb)].prob=0.9864354941576942]) chunk: > 'none' > > ProcessingState - TokenData: > > 'utilisée'[linkable=false(linkabkePos=false)| > > matchable=false(matchablePos=false)| alpha=true| seachLength=true| > > upperCase=false] > > > > token is not processed, which i am fine with since its POS tag is VPP > > > > > > Now On the token "Utilisée" for the text: "AE est une mesure couramment > > Utilisée." > > ProcessingState > 5: Token: [543, 551] Utilisée (pos:[Value [*pos: > NPP* > > (olia:ProperNoun|olia:Noun)].*prob=0.19181597467804898*]) chunk: 'none' > > ProcessingState - TokenData: > > 'Utilisée'[linkable=true(linkabkePos=null)| > > matchable=true(matchablePos=null)| alpha=true| seachLength=true| > > upperCase=true] > > > > so the POS tag is OK, but the prob doesn't reach the threshold (which i > set > > to 0.55), here is the log of the processing of the token > > > > EntityLinker --- preocess Token 5: Utilisée (lemma: null) linkable=true, > > matchable=true | chunk: none > > EntityLinker - 4:'couramment' (lemma: null) linkable=false, > > matchable=false > > EntityLinker - 6:'.' (lemma: null) linkable=false, matchable=false > > EntityLinker + 3:'mesure' (lemma: null) linkable=true, matchable=true > > EntityLinker >> searchStrings [mesure, Utilisée] > > > > is it a problem of processing of POS tagging, of UpperCase linking or > did i > > misunderstood something. > > > > Thank you for the time you spend helping us users, it is very > appreciated. > > best regard, Joseph > > > > 2013/6/3 Rupert Westenthaler <[email protected]> > > > >> Hi Joseph > >> > >> On Mon, Jun 3, 2013 at 10:01 AM, Joseph M'Bimbi-Bene > >> <[email protected]> wrote: > >> > I think it is the tokenizing process of Talismane NLP, since my > >> enhancement > >> > chain is : > >> > -langdetect > >> > -talismaneNLP > >> > -MyVocabulary > >> > > >> > >> I also used Talismane when testing and I was not seeing tokens like that > >> > >> Here are an excerpt of my log (with minSearchTokenLength set to 2) > >> > >> --- preocess Token 11: AE (lemma: null) linkable=true, matchable=true > >> | chunk: none > >> - 10:'*' (lemma: null) linkable=false, matchable=false > >> - 12:'*' (lemma: null) linkable=false, matchable=false > >> - 9:'indiquant' (lemma: null) linkable=false, matchable=false > >> - 13:'une' (lemma: null) linkable=false, matchable=false > >> - 8:')' (lemma: null) linkable=false, matchable=false > >> + 14:'servitude' (lemma: null) linkable=false, matchable=true > >> >> searchStrings [AE, servitude] > >> > >> best > >> Rupert > >> > >> > >> -- > >> | Rupert Westenthaler [email protected] > >> | Bodenlehenstraße 11 ++43-699-11108907 > >> | A-5500 Bischofshofen > >> > > > > -- > | Rupert Westenthaler [email protected] > | Bodenlehenstraße 11 ++43-699-11108907 > | A-5500 Bischofshofen >
